AI models break down when trained on recursively generated data

Earlier, we analyzed a paper on the inevitability of AGI by Leopold Aschenbrenner, which talked about the “Data Wall.” The problem is that there is a finite amount of data on the Internet to train new AI models. One approach to bypassing this wall is to create synthetic data, i.e. data generated by AI. Scientists from Oxford and Cambridge Universities published a paper exploring the possibility of using such data to train new models. Spoiler: the models broke.

Terms used

Expressivity — the breadth of ideas that can be represented and communicated. The more expressive a language is, the greater the number of ideas it can use to represent. In other words, the expressive power of a language determines how well it is suited to describing different concepts and ideas. A highly expressive language can express complex ideas clearly and precisely, while less expressive languages ​​may be limited in their ability to convey certain thoughts or require more verbose constructions to achieve the same goal.

Approximation (approximation) — is the process of estimating the value of a function for a given input parameter using a simpler function that closely resembles the original. In other words, when the exact value of a function is difficult or impossible to calculate directly, we can try to find an approximate solution using another, simpler function. This simplified function should behave similarly and produce results close to the original function.

annotation

The Stable Diffusion model has revolutionized the way we generate images from descriptive text. GPT-3.5 and GPT-4 have demonstrated good performance across a wide range of tasks. When we first met these models, it became clear that generative AI is not a temporary phenomenon. It will (and has) significantly change the way we generate text and images.

In this study, we look at what happens to generative models when LLMs are trained on data generated by the models themselves. We find that indiscriminate use of such data eventually leads to irreversible defects. We call this effect theModel collapse» and found that it can occur in both LLM and VAE and GMM. This phenomenon occurs among all trained generative models, and we want to demonstrate that this problem should be taken seriously.

Main part

Developing an LLM is a complex process that requires a huge amount of training data. While today’s LLMs have been trained primarily on human-generated text, this may change in the future. If training data for future models is also collected from the internet, then a certain percentage of that data will be AI-generated content. In this article, we explore what happens if text generated by, say, ChatGPT makes up a large portion of the training data for subsequent models. What happens to GPT-{n} as n gets larger?

We find that indiscriminate training on data generated by other models leads to “model collapse,” a degenerative process in which the model forgets the true distribution of the data over time, even in the absence of a shift in the distribution. Over time, the model begins to lose information about the true distribution, initially manifested by the disappearance of “tails,” and the learned behavior converges over generations to a point estimate with very low variance. We also find that this process is unavoidable even in cases with near-perfect conditions for long-term learning, i.e., in the absence of function estimation error.

There are also two related concepts to model collapse from the existing literature: catastrophic forgetting, which occurs in the context of continuous learning without tasks, and data poisoning, which maliciously causes unintended behavior. Neither can fully explain the phenomenon of model collapse, since the conditions are fundamentally different. However, they still provide another perspective on the observed phenomenon. We note that access to the original data distribution is crucial: in tasks where the “tails” of the underlying distribution matter, access to real human-generated data is necessary. In other words, the widespread use of LLMs to publish content on the Internet will inevitably pollute the training dataset of their successors over time.

What is model collapse?

Model collapse — is a degenerative process that affects generations of generative models, whereby the generated data eventually pollutes the training set of the next generation. Trained on polluted data, they misperceive reality. This process is depicted in the figure below. We distinguish two special cases: early model collapse and late model collapse. In early model collapse, the model begins to lose information about the tails of the distribution; in late model collapse, it converges to a distribution that bears little resemblance to the original, often with significantly reduced variance.

Model collapse

Model collapse

This process occurs because of three specific sources of error that accumulate over generations and cause deviations from the original model:

  1. Statistical approximation error. This is the basic type of error that arises because the number of samples is finite and disappears as the number of samples tends to infinity. This occurs because of the non-zero probability of information loss at each resampling step.

  1. Functional expressivity error. This is a secondary type of error that occurs due to the limited expressiveness of the function approximator. In particular, neural networks are universal approximators only when their size tends to infinity. A simple example of an expressiveness error is an attempt to approximate a mixture of two Gaussian distributions with a single Gaussian distribution. Even if we have perfect information about the data distribution (i.e. an infinite number of samples), model errors are inevitable. However, in the absence of the other two types of errors, this can only happen in the first generation.

  1. Functional approximation error. This is a secondary type of error, arising mainly from limitations of the training procedures, such as structural biases in stochastic gradient descent or the choice of the objective function. This error can be viewed as arising in the limit of infinite data and perfect expressiveness at each generation.

Each of the above errors can lead to model collapse to some degree. It is worth noting that there are other types of errors. For example, in practice, computers have limited accuracy. Now let's try to explain how the above factors lead to the errors we observe, how the different sources can accumulate, and how we can quantify the mean deviation of the model.

Mathematical justification

In this section, we provide a theoretical justification for the model collapse phenomenon. We argue that the model collapse process is universal for generative models that are recursively trained on data generated by previous generations. We quantify the sources of errors discussed in the previous section by examining two mathematical models that turn out to be simple enough to provide analytical expressions for the quantities of interest, but also reflect the model collapse phenomenon: a discrete distribution in the absence of functional expressiveness and approximation errors, and a multivariate Gaussian approximation reflecting joint functional expressiveness and statistical errors.

The general stochastic process that we consider and call learning using generational data is as follows. The data set in generation  i – This  {{\mathcal{D}}}_{i}consisting of independent and identically distributed random variables {X}_{j}^{i} with distribution pi, j∈ {1,…,Mi} . It denotes the size of the data set. Moving from generation  i to the generation  i+1we aim to estimate the distribution of emissions in  {{\mathcal{D}}}_{i}with approximation  {p}_{{\theta }_{i+1}}={{\mathcal{F}}}_{\theta }({p}_{i})We call this step functional approximation, {p}_{{\theta }_{i+1}}={{\mathcal{F}}}_{\theta }({p}_{i}). Then the data set  {{\mathcal{D}}}_{i+1}generated by sampling from  {p}_{i+1}={\alpha }_{i}{p}_{{\theta }_{i+1}}+{\beta }_{i}{p}_{i} +{\gamma }_{i}{p}_{0}with non-negative parameters αi, βi, γi, the sum of which is equal to 1. That is, they represent the shares of data used in different generations. This corresponds to mixing the data coming from the original distribution (γi), the data used by the previous generation (βi), and the data generated by the new model (αi). We call this the sampling step. For the mathematical models presented below, we consider αi = γi = 0, that is, only the data of one step are used, while the numerical experiments are carried out on more realistic parameter options.

Discrete distributions with exact approximation

In this subsection, we consider a discrete probability distribution in the absence of functional approximation and expressiveness errors, that is, {\mathcal{F}}(p)=p. In this case, the collapse of the model occurs only due to statistical errors at the sampling stage. First, the tails (unlikely events) begin to disappear as a result of the low probability of their sampling, and over time the support of the distribution shrinks. Denoting the sample size as Mif we consider the state i with probability  q\le \frac{1}{M}the expected number of samples with value iemanating from these events, will be less than 1. In practice, this will mean that we lose information about them. Considering a more general state i with probability qusing standard conditional probability, we can show that the probability of information loss (i.e., no data at any generation) is 1 - qwhich implies that the distribution should converge to a delta function located at some state, with the probability of ending up in a particular state equal to the probability of sampling that state from the original distribution.

This can be shown directly by looking at the process {{\bf{X}}}^{i}\to {\mathcal{F}}\,\to {p}_{i+1}\to {{\bf{X}}}^{i+ 1} as a Markov chain, since X^{i+1} It depends only on X^i. Besides, if all  X_j^i have the same value, then in the next generation the approximated distribution will be exactly the delta function, and therefore all  {X}_{j}^{i+1} will also have the same value. This means that the Markov chain contains at least one absorbing state, and therefore will converge to one of the absorbing states with probability 1. For this chain, the only absorbing states are those corresponding to delta functions. As a result, as we watch the model collapse progress, we are guaranteed to end up in a constant state, losing all information about the original distribution, when the chain is absorbed. This argument also works in general because floating-point representations are discrete, making the Markov chain over the model parameters discrete. Thus, as long as the model parameterization allows delta functions, we will end up with one, because due to sampling errors, the only possible absorbing states are delta functions. Based on the above discussion, we see how early model collapse, where only unlikely events are cut off, and late model collapse, where the process begins to converge into a single regime, should occur in the case of discrete distributions with perfect functional approximation.

Gaussian Model Collapse

Assume that the original data are drawn from a distribution  D_o (not necessarily Gaussian) with non-zero sample variance. Suppose that X_n are recursively fitted using unbiased estimates of the sample mean and variance from the previous generation,  {X}_{j}^{n}|  {\mu }_{n},{\Sigma }_{n} \sim {\mathcal{N}}({\mu }_{n},{\Sigma }_{n})with a fixed sample size. Then,

{\mathbb{E}}[{{\mathbb{W}}}_{2}^{2}({\mathcal{N}}({\mu }_{n},{\Sigma }_{n}),{{\mathcal{D}}}_{0})]\to \infty ;\,{\Sigma }_{n}\,\mathop{\to }\limits^{{\rm{a}}.  {\rm{s}}.}\,0\,\,{\rm{a}}{\rm{s}}\,\,n\to \infty ,

Where  {{\mathbb{W}}}_{2} denotes the Wasserstein distance between the true distribution and its approximation at a generation n.

In other words, this means that not only does the n-th generation approximation deviate arbitrarily far from the original, but it also collapses to zero variance with probability 1 as the number of generations increases. The results are very similar to those observed in the discrete case, with this theorem illustrating the effect of late model collapse, when the process begins to collapse to zero variance. Early model collapse can also be seen, and the interested reader is referred to the supplementary material for a more detailed discussion.

Model Collapse in Language Models

Model collapse is universal across different families of machine learning models. However, while small models like GMM and VAE are typically trained from scratch, large language models (LLM) are a different story. They are so expensive to retrain from scratch that they are typically initialized with pre-trained models like BERT4, RoBERTa5, or GPT-2, which are trained on large text datasets. They are then fine-tuned for different application tasks.

We investigated what happens to language models when they are successively retrained on data generated by other models. We can easily reproduce all the experiments described in this paper with larger language models in a training-from-scratch mode to demonstrate model collapse. Given that training a single moderately large model produces twice as much CO2 as one human produces in a lifetime, we decided not to conduct such an experiment and instead focus on a more realistic problem setting for the subsequent proof. Note that even the language experiments described in this paper took several weeks to complete.

We will evaluate the most common language model training scheme, retraining. In this scheme, each training cycle starts with a pretrained model on recent data. The data comes from another pretrained model. Since training is limited to producing models close to the original pretrained model, and the data points generated by the models will typically produce very small gradients, we can expect that the model should change only moderately after retraining. We chose to retrain the OPT-125m language model, which was released by Meta via Hugging Face6.

We retrained it on the wikitext216 dataset. To generate data from the trained models, we used a five-beam search. We limited the length of the training sequences to 64 tokens. Then, for each token sequence in the training set, we asked the model to predict the next 64 tokens. We also looped through the entire original training dataset and created a synthetic dataset of the same size. Each experiment was run five times, and the results are shown as five separate runs with different values ​​of the random initial conditions. The original model retrained on the real wikitext2 data gets an average perplexity of 34 with the baseline value and 115 with zero training, meaning it successfully learns the task. We then considered two tuning options:

  • Five epochs, no original training data. Here the model is trained for five epochs starting from the original dataset, but without storing the original data for subsequent runs. The overall performance on the original task is shown in the figure below. We found that training on generated data allows adaptation to the baseline task, losing some performance, from 20 to 28 perplexity points.

  • Ten epochs, 10% of original training data retained. Here, the model is trained for ten epochs on the original dataset, and 10% of the original data points are randomly selected for each new training generation. The overall performance on the original task is shown in the figure below. We found that retaining the original data allows for better retraining of the model and results in only a minor performance penalty.

Both training modes resulted in poor performance of our models, but we found that training on generated data is possible and the models can successfully learn (some) basic tasks. Importantly, the observed behavior is consistent with the theoretical justification in the Mathematical Justification section.

Example

Below we give an example of how the responses of the OPT-125m model, which is subject to model collapse, deteriorate from generation to generation.

  • Input: some started before 1360 — was typically accomplished by a master mason and a small team of itinerant masons, supplemented by local parish laborers, according to Poyntz Wright. But other authors reject this model, suggesting instead that leading architects designed the parish church towers based on early examples of Perpendicular.

  • Generation 0: Revival architecture such as St. John's Cathedral in London. The earliest surviving example of Perpendicular Revival architecture is found in the 18th @-@ century Church of Our Lady of Guernsey, which dates from the late 19th century. There are two types of perpendicular churches: those.

  • Generation 1: architecture such as St. Peter's Basilica in Rome or St. Peter's Basilica in Buenos Aires. There is no evidence that any of these buildings were built during the reign of Pope Innocent III, but it is possible that they may have been built during the reign of his successor, Pope Innocent.

  • Generation 5: ism, which had been translated into more than 100 languages ​​including English, French, German, Italian, Spanish, Portuguese, Dutch, Swedish, Norwegian, Polish, Hungarian, Slovak, Lithuanian, Estonian, Finnish, Romanian, Bulgarian, Turkish, Croatian, Serbian, Ukrainian, Russian, Kazakh, Kyrgyz.

  • Generation 9: architecture. In addition to being home to some of the world's largest populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @-.

We found that the data generated by the language models in our experiments ended up containing a large number of repeated phrases, as in the example above. The repetition problem was observed in almost all generative models, and to rule this out as a cause of model collapse, we additionally ran experiments where the models were explicitly encouraged to produce non-repetitive sequences with a repetition penalty of 2.0. We found that this forced the models to produce sequels with fewer repetitions, which consequently led to worse performance in subsequent models, in particular a doubling of perplexity compared to the original results. The models remained just as susceptible to model collapse, if not more so.

The process described demonstrates that fine-tuning of language models does not constrain the effects of model collapse, and models that undergo fine-tuning are also vulnerable.

The source code for all experiments can be viewed Here. You can read the original article without translation here.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *