Data selection, small language models and what does Schmidhuber have to do with it

Data selection, small language models and Schmidhuber

Large language models are good, but I wonder if comparable quality can be achieved with small models. It will not be possible to conduct a dialogue with GPT-2, much less write a thesis or scientific article. It and other small language models (SLM) mostly produce weak text, even if you train them on the entire Wikipedia.

Perhaps it's worth remembering here theory Schmidhuberwhich is known to I came up with everything.

Since 1990, he has developed theories of creativity and intrinsic motivation. Formally, it is applicable not only and not so much to machine learning, but in general to any creative process. The main ingredient of Schmidhuber's theory is intrinsic motivation, which depends on interest. That is, the reward mechanism is not external, but internal, depending on the learning agent itself. Its essence is that it is interesting to learn only what a new pattern can give, open a new skill (this will be an encouragement). If an action does not bring anything new, then it is an uninteresting action and will not provide any reward.

Returning to language models, we can assume that trying to train a small model on a large dataset is very similar to trying to give a textbook on Mathan to a preschooler. You won’t be able to learn new things, there will be no encouragement. The result will be expectedly bad – the child will not be able to retell a single theorem, and the model will produce an incoherent text. If we continue this logic, then a small model needs a small (but high-quality!) dataset. One that she can stomach. One that, according to Schmidhuber, will be interesting to her.

To achieve this, there are at least two approaches that can be used to support internal motivation. The first is to generate the required dataset using LLM. The second is to select the desired dataset from the raw corpus according to the given rules. In both cases, you need to ensure that a small dataset preserves all the main characteristics of a given language – grammar, logic, structure. Yes, in both cases, the output model’s abilities will be lexically limited within this small dataset, but qualitative within the same limits.

A high-profile example of the first approach in April of this year demonstrated at Microsoft. With the help of the “adult” GPT, they created a corpus of stories for children 3-4 years old. Training SLM on this dataset, called TinyStories, lasts less than a day on a single GPU. The authors tried to train several neural networks with a small number of parameters (from 1 million) and compared them with GPT-2 XL (1.5 billion) on the same prompt.

The largest SLM trained on TinyStories (80 million) gives almost perfect grammar and consistency scores, although it is inferior to GPT-4 in creativity (GPT-4 scoring). The authors continued to implement the same approach in phi-1 And phi-1.5, who were trained on high-quality content from The Stack and StackOverflow and on GPT-generated textbooks, plus a set of problems with solutions in Python. The older model (1.3 billion parameters) works no worse than Llama-7b from Meta.

The second possible approach is to select high-quality data from the raw dataset, taking into account what result you want to get as an output. This is what they did at Stanford.. The authors proposed the Data Selection with Importance Resampling framework. To select good data from a large corpus (they took The Pile – 890GB of text from scientific articles, Wikipedia, forms and news), a small target dataset with the required “quality” of language is used. The advantage here is that the target language can be any (and not just, for example, stories for 3-4 year olds). The ready-made dataset for training is selected from the source corpus so that it minimizes the Kullback-Leibler discrepancy with the target dataset compared to a random sample.

A similar, but model-driven, idea offered on Google. There, the quality of the sample is assessed by the degree of self-influence (SI). That is, the degree to which a given dataset affects the quality of the model. Using SI, the authors filter noise from the original dataset. First, those samples for which SI is better are given priority and training is carried out on them. At the second stage, priority is removed.

Almost quoting Schmidhuber, the authors from Oxford, accelerated learning due to the fact that the model does not waste time on data from which nothing can be learned, or which has already been “passed”. That is, almost literally ignores sources that are uninteresting according to Schmidhuber. With this “training program,” the same accuracy is achieved 18 times faster, and the maximum is 2% higher.

We think that soon we can expect a more complete implementation of Schmidhuber’s approach to building curriculum learning for language models. After all, there is a simple and understandable measure of text complexity for a language model – its perplexity on the text, and perhaps constructing a curriculum in this way will speed up the training of language models.

More of our AI reviews on the Pro AI channel.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *