What language do language models think in?

How does your brain work when you speak a foreign language? Does it first translate internally into your native language or does it immediately understand and formulate phrases in the foreign language? Most likely, each person will answer this in their own way, and the answer will depend on the level of language proficiency, and on the method by which you were taught this language, and on the peculiarities of thinking and speech in general. It is even more interesting to understand how things are with large language models. They are taught mainly on English texts, but somehow they suddenly begin to speak other languages ​​quite well. Yes, worse than in English, but still quite decently. So it is natural that on the general wave of interest in the interpretability of AI, a request arises to understand this multilingualism of models.

Intuitively (and generally just by analogy with a person) it seems that since the model was trained in English, it must be its “native” language. That is, when we ask GPT in Russian, it first translates into English, formulates the answer there, and then translates back into Russian. If this is really true, then this means that the model has some bias not only towards English grammar and vocabulary, but also towards the corresponding metaphors, logic, behavior. That is, towards the mentality of the English-speaking world. But what if this is not true? Then it is even worse – then it is completely unclear how the model achieves such a good result on such a modest amount of training data.

The team from the Lausanne EPFL held experimentto find out how LLM works when spoken to in different languages. The authors took models from the Llama-2 family. They were trained on multilingual texts, but the vast majority (89.7%) were in English. It is worth mentioning right away that since we are talking about a huge amount of training data, even a small percentage is still a lot. For example, 0.13% of Chinese tokens is actually 2.6 billion. More than the Chinese themselves.

To interpret the hidden states of the model, the authors used the Logit lens technique. Its principle is to turn not only the final hidden states of the last block of the transformer into tokens, but also the intermediate ones. They are all the same in form, so there are no fundamental restrictions on this. In other words, we prematurely extract the hidden states and decipher them. If we continue the hypothesis, these non-final hidden states should contain something like the native language of the model. To avoid ambiguity, the authors compiled a set of prompts with the only correct answer of one word. For example, they gave the model a pair of words in French and Chinese as input, and then asked it to continue the series with the correct Chinese word:

Français: "vertu" - 中文: "德"

Français: "siège" - 中文: "座"

Français: "neige" - 中文: "雪"

Français: "montagne" - 中文: "山"

Français: "fleur" - 中文: "

The diagram below shows the output tokens obtained at different layers using the logit lens. The output produces the correct character “花” (flower), the initial layers produce something incoherent and unrelated to flowers in any language, and the middle layers produce the correct meaning, but with a preference for English.

This and several other tests were conducted on German, French, Chinese, and Russian. To investigate the hypothetical reference language inside Llama-2, the authors applied a logit lens to the hidden states corresponding to the last input token at each layer. This yields a probability distribution for the next token, and since we are talking about one word in one language, it will be equal to the probability distribution of one language or another.

The graph shows the language probabilities depending on the layer (from left to right for models 7B, 13B and 70B). In the first half of the layers, the probability of (correct) Chinese is zero. The same is true for English. Somewhere in the middle, English makes a sharp jump, and closer to the output layers it falls, while Chinese slowly grows, and only in the last few layers does it overtake English and sharply tend to one. This pattern remains essentially the same in models of different sizes and for different tasks.

Now let's try to make some geometric representation that will help us understand the paths of the transformer. If we simplify greatly, then the task of the transformer is to reflect the input embeddings on the output embeddings. Each layer changes the internal vector obtained from the previous layers. Geometrically, this corresponds to some trajectory in d-dimensional Euclidean space, (d is the dimension of the embeddings). Hidden states live in a hypersphere with a radius of √ d. On this sphere, the authors show the trajectory of translation (in the example – from French to Chinese), introducing in addition to probability also “energy” and entropy. Energy reflects what part of the hidden state predicts the next token. As a result, the trajectory in the depths of the transformer consists of three phases:

  1. High entropy (14 bits), low token energy and no dominant language. At this stage, the authors believe, suitable representations of input tokens are built. The model does not try to predict the next token yet (this is indicated by the low “energy” – hidden states are orthogonal to the space of output tokens). Hence the high degree of freedom.

  2. Low entropy (1-2 bits), energy is still low, but English dominance is emerging. This is a kind of abstraction or concept area. Embeddings are getting closer to the output and reflect the general idea – these can be different languages ​​and different close meanings. English variants gain an advantage due to the clear dominance of English in training. The energy is not so high at this point, because the hidden states still contain more information about the input than about the output.

  3. Energy increases to 20-30%, entropy remains low, Chinese becomes the main language. At this stage, abstract concepts are connected with the target language. Information that is not needed for the next token is thrown away, i.e. all the “energy” is directed to generating the answer.

The results can be interpreted in different ways. On the one hand, somewhere in the depths of the model, the correct (or at least close in meaning) answer in English does indeed appear first, and only then in the required language. This can be interpreted as the model first translating into native English.

But if we use the concepts of energy and entropy, which the authors use, it turns out that the model first generates a meaning, a concept, an abstract idea. Yes, in English, but only because it has more English words in stock. That is, LLM has a native language, but not English, but the language of concepts. English still remains the basis for the model, but in a completely different sense.

More of our AI article reviews — on the Pro AI channel.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *