A Bird's Eye View of the Evolution of Language Models for Text Generation

In this article, I would like to share my notes on how language models (LMs) have evolved over the past decades. This text can serve as a tutorial for beginners and help understand the key concepts of language models throughout their history. It is worth noting that I do not go into implementation details and mathematical aspects, but the level of description is sufficient for a proper understanding of the evolution of LMs.

What is language modeling?

In a broad sense, language modeling — is the process of formalizing language, in particular natural language, to make it machine-readable and processable in various ways. Thus, it concerns not only text generation but also language representation.

The most common association with “language modeling”, thanks to Generative AI, is closely related to the process of text generation. This is why my article considers the evolution of language models exclusively from the perspective of text generation.

N-gram based language models

Although the foundations of n-gram language models were founded in the mid-20th centurytheir widespread distribution began in the 1980s and 1990s.

N-gram language models use Markov's hypothesiswhich states that in the context of language modeling, the probability of the next word in a sequence depends only on the previous word(s). Therefore, an approximation of the probability of a word, given its context in an n-gram model, can be formalized as follows:

The probability of the next word, given the sequence of all previous ones, can be approximated as the probability of the next word, given N previous ones (for example, N=2 – bigram model).

where t is the number of words in the whole sequence and N is the size of the context (unigram (1), bigram (2), etc.). Now the question arises, how to calculate these n-gram probabilities? The simplest approach is to use the number of n-grams (which is calculated on a large text corpus “without a teacher”):

Probability of the next word, given N previous ones: numerator - how many times the resulting sequence occurs in the data, denominator - how many times the sequence of previous words occurs in the data. Let's call these n-gram counters. — Probability of the next word, given N previous ones: numerator – how many times the resulting sequence occurs in the data, denominator – how many times the sequence of previous words occurs in the data. Let's call these n-gram counters.

Obviously, estimating the probability using the formula above may seem naive. What if the numerator or even the denominator values are zero? This is why more advanced probability estimation methods include smoothing or rollback method (English backoff) (for example, additive smoothing, stupid backoff, Kneser-Ney smoothing). We will not discuss these methods here, but the conceptual approach to probability estimation does not change when using any smoothing or backsliding method. The following example shows a high-level representation of an n-gram language model:

High-level representation of the n-gram language model

Given the “counter” data, how can we generate text from such a language model? In fact, the answer to this question applies to all language models discussed below. The process of choosing the next word based on the probability distribution over the language model is called sampling (eng. sampling). Here are some sampling strategies applicable to n-gram models:

greedy sampling – choose the word with the highest probability;
random sampling – select a random word according to the probability distribution.

Language models based on feedforward neural networks

Despite smoothing and backtracking techniques, probability estimation in n-gram language models is still intuitively too simple for natural language modeling. A breakthrough approach proposed by Yoshua Bengio and his co-authors in 2000was very simple, but at the same time innovative: what if instead of n-gram “counters” we use neural networks to estimate word probabilities? Although the paper claims that recurrent neural networks (RNNs) can also be used for this task, the focus is on the feedforward neural network (FFNN) architecture.

The FFNN architecture proposed by Bengio is a simple multi-class classifier (the number of classes is equal to the vocabulary size V). Training is based on the task of predicting a missing word w in a sequence of context words c: P(w|c), where |c| is the size of the context window. The FFNN architecture proposed by Bengio et al. is shown below:

FFNN architecture for estimating the probability of the next word in a sequence

Such FFNN-based language models can be trained on large text corpora in an unsupervised manner (i.e. no explicitly labeled dataset is required).

As for sampling, besides the greedy and random strategies, there are two more that can be applied to neural network-based language models:

top-k sampling — the same as the greedy strategy, but executed within a renormalized set of top-k words (softmax is recalculated for top-k words),
nucleus sampling — the same as top-k, but instead of the number k, a percentage is used.

Language models based on recurrent neural networks

So far, we have assumed that the probability of the next word depends only on the previous ones. We have also considered a fixed context or n-gram size for estimating the probability. What if the relationships between words are also important? What if we want to consider the entire sequence of preceding words to predict the next word? This is exactly what can be modeled perfectly using recurrent neural networks (RNN)!

The natural advantage of RNNs is that they are able to capture dependencies across the entire word sequence by adding the output of the hidden layer from the previous step (t-1) to the input of the current step