N-gram language model in the era of LLM – how it works and why it is needed

Trends are trends, and there will always be those who swim against the tide. While the trend is to reduce the size of the model, the authors from the University of Washington decided to ignore the size altogether and test whether it makes sense to return to N-gram language models in the era of LLM. It turned out that it does. In any case, at least just out of interest.

Perhaps no one really paid attention to N-grams for a long time. The scaling techniques that took Transformers to stratospheric heights were not applied to them. But the authors of the article Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens trained an N-gram model for 1.4 trillion tokens – this is the most gigantic model of this architecture. It takes up 10 tebibytes, but it only takes 20 milliseconds to count n-grams, no matter what n is. The most interesting thing is the possible values ​​of n.

The development of n-gram models quickly came up against an exponential increase in size from n. Already with 5 grams on an adequate dataset, the number of combinations for which it is necessary to store values ​​in the probability table is very large, so we did not go beyond 5. But 5 (or actually 4) tokens is such a small context that even comparison with LLM was out of the question. But the authors of the article not only raised n above 5, they removed any restrictions – n can be as large as desired. Therefore, the authors conventionally call these ∞-grams, or “infinitegrams”. Of course, this required a slightly modified approach.

The numerator and denominator in the fraction that determines the probability of a token after a given context may be zero. This refers to a simple definition, when the number of given n-grams in the corpus is simply counted. That is, if there were no such sequences of tokens in the corpus, then nothing will be counted. To cope with this, and even on infinitegrams, the authors used the backoff technique, that is, the tilting technique. If the numerator and denominator are equal to zero, we will decrease n by one until the denominator becomes positive (the numerator may remain zero). The discarding begins conditionally at infinity. In fact, the initial “infinite” n is calculated as the length of the longest suffix from the prompt that occurs in the corpus, plus one.

10 tebibytes, of course, is a very small size for an infinite-gram model with 1.4 billion tokens (a classic 5-gram model would weigh 28 tebibytes). To achieve this, the authors compiled a special data structure – an array of suffixes. Watch your hands. For a given array of length N, all possible suffixes are first constructed and arranged in lexicographical order. The suffix array is constructed in such a way that the zero position is the position where the zero suffix first appears in the original array. On the first is the position where the first occurs and so on.

The suffix array is built in linear time relative to the length of the tokenized corpus. On RedPajama with 1.4 trillion tokens, it took 56 hours, 80 CPU and 512G RAM.

To calculate the probability of an n-gram, you need to count the number of “needles” (strings of tokens of a given length) in a “haystack” (an array of suffixes). And since this stack reflects an ordered set of all suffixes, the positions of all the necessary “needles” are located in one sequential fragment of the suffix array. Therefore, you just need to find the first and last and calculate the difference between them. This takes logarithmic time in terms of the size of the array and linear time in terms of the size of the search string.

The authors attached backendgrams to large language models using linear interpolation and evaluated perplexy on the Pile dataset. On GPT-2, GPT-Neo and LLaMa-2, the improvements turned out to be very convincing – from 12% to 42% on the test set. The authors, however, add that such a noticeable improvement in GPT-2 may be due to the fact that it was the only one that was not trained on Pile

Another important note from the authors. Infinitegrams can significantly improve LLM perplexing, but in plaintext generation tasks they may not only not help, but also interfere. Infinitegrams can, for example, extract completely unnecessary tokens and the model will produce nonsense. So the authors do not pretend to replace the LLM at all. But they still plan to look for possible ways of joint use here.

More of our AI article reviews on the Pro AI channel.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *