Are LLMs Fully Multilingual?

When I first started to play with ChatGPT in December of last year, I was pleasantly surprised to discover it could handle multiple languages. Needless to say, the LLM can handle translation from and into multiple languages. But it can also take non-English prompts and provide responses in the original query language. A recent test comparing the various automated translation platforms suggests ChatGPT beats the competition in most languages. I do not find that surprising, as ChatGPT uses a much more sophisticated LLM that has been fine-tuned in several ways.

If you are using the free version of ChatGPT, you are limited to 4096 tokens per query. That includes both your query and the response provided by the bot. So the shorter the prompt, the longer the response. Words comprise one or more tokens, so, say, 100 words always require more than 100 tokens. You can check OpenAI’s tokenizer here to check how that works. OpenAI assumes that, on average, there are 0.75 tokens for every English. The maximum number of words from a query is 3,072, on average. That is the equivalent of five single-spaced pages. While the interactive ChatGPT 3.5 is free, the API charges by thousand tokens used, 0.0015 USD for inputs and 0.002 for outputs. Generating 3,072 words will thus cost 0.006 USD, in addition to the number of tokens used in the query. Very affordable indeed. However, the number of tokens per word varies across languages, with some having higher requirements and, thus, higher development costs.

Using the API is instrumental for prompt engineering, a powerful way to efficiently use LLMs. Designing the prompts beforehand and submitting them in a specific order can help get much-improved responses from the model. Trial and error and automation are also feasible. But of course, using the API demands some Python knowledge and access to a platform (local or could) where one can run the code. The other advantage is that one could avoid ChatGPT translations altogether and instead use one of the “open source” LLMs, as shown here. I replicated that example in my own environment, and it works fine, albeit it takes time as I do not have GPU access. But I did not have to pay for token use and abuse!

But how do these “multilingual” models really work? An example of failure is asking ChatGPT or Bard to translate “The dog eats the apple” into Spanish. Both spit out “El perro come la manzana,” which is incorrect as they seem oblivious to reflexive verbs. The correct answer is “El perro SE come la manzana.” This result would be the same as if the phrase “A Raúl no le gustan las manzanas” was  translated into “Raul no like apples.”  However, the open source model I tested provides the correct result! How could that be?

The recent report on multilingual LLMs published by the Center for Democracy and Technology casts plenty of light on the subject. For starters, we all know that most of the content that has been digitized and is available on the Internet is in English. In specialized areas such as academia and industry, for example, the gap between English and content in other languages is even more significant. Consequently, multilingual LLMs are trained using datasets that vary in size, the larger ones being in English, followed by some of the other top languages. However, some of the non-English data results from automated translations from the original language into English. Of course, such models can be further refined by adding new data o the original set, but it is usually done in English. In other words, I get additional text on, say, Indonesian, which I translate into English and then use that to refine my multilingual model. Here, the potential for error multipliers is significant., not to mention issues of bias, discrimination and lack of local context.

The report makes a series of recommendations for tech companies on addressing this challenge for tech companies, researchers and funders, and governments. Some of them are not unique to LLMs, as expected. Moreover, the report does not factor in the role of governments in the Global South, that working alongside researchers and funders could create local ecosystems to develop more comprehensive non-English LLMs. For example, the UAE recently released its state-of-the-art Falcon LLM. Such a development could serve as a role model for developing more sophisticated non-English LLMs. Do not expect Big Tech or other super-profits-driven entities to do this. The market does not point in that direction anyways.

The model used in the example I replicated on my not-so-powerful laptop has been created by the  University of Helsinki. It offers more than one thousand models for many languages in different flavors. While the report does not mention it, that is an ongoing initiative that should be supported. So there is light at the end of the multilingual tunnel. But we still need to walk the talk all the way to the end.

Raúl