from pre-training to instructional training

This is my translation articles about learning language models on medium.com. A year ago I prepared a short study on language patternsand to consolidate the practice, I began to try fine-tuning large language models (LLM) for various applied tasks. Initially, I received contradictory results, which prompted me to study the learning theory in more depth. In this article, I decided to describe my theoretical research and supplement it with a number of examples obtained from personal experience. To train an opensource model, various Dadasets are used, they are also published from different stages of training, and the success of fine-tuning depends on the choice of the right base model. In practice, the pipeline for training large language models consists of several fixed stages: The first stage is preliminary training, which involves training on a massive text corpus using the task of predicting the next token. At this stage, the model learns the model of the language or languages. Supervised fine-tuning follows on task-specific query-response pairs. One of the most common tasks at this stage is the ability of the model to respond to queries in a chat format. Finally, fine-tuning to user preferences is done using the technique of Reinforcement Learning from Human Feedback, also called instruct fine-tuning. A good description of this process is the technical report on the Qwen language model [1]They described the training steps and published three models: a pre-trained base model, a supervised chat model, and a model further trained on user preferences using RLHF.

Transformer Architecture: Tokenization and Positional Encoding

Tokenization is a fundamental part of the language model, it reminds me of generative grammar Noam Chomsky. Chomsky proposed to split a sentence into tokens and build a relationship graph describing the grammatical relations in the sentence. In the Transformer architecture, the attention mechanism acts as an effective detector of token interdependencies. In the paper by researchers from Standford and Facebook AI [2] analyze the mechanism of attention in transformers, they found that different heads of attention specialize in different types of relationships. For example, one head may focus on the relationship between verbs and objects, another on the relationship between objects and prepositions, as shown in Figure 1.

Scheme 1. Specialization of attention heads.

Scheme 1. Specialization of attention heads.

Currently, one of the most common tokenization techniques is compression with Byte Pair Encoding. It is used in solving the problem Neural Machine Translation for efficient tokenization of words not in the dictionary [3]. The tokenizer is trained on a text corpus, where the most frequent words become individual tokens, less frequent words are tokenized using combinations of existing tokens, or in the worst case, using individual letter tokens. If a language is not widely represented in the corpus or is rare, it may be underrepresented in the tokenized vocabulary, which may affect the quality of text generation by the model in that language. After tokenization, the next step is a latent representation layer (embedding transformation), positional coding, and attention layers with normalization. There is a detailed article about the decoder-only variant of the transformer architecture. [4]. The most famous decoder architectures are LLaMA, GPT, but BERT is an example of an encode-only architecture. T5 is a classic encoder-decoder architecture. The generation of the model result consists of a series of steps, each of which calculates the probabilities for a possible subsequent token from all possible tokens in the dictionary. To select a specific token for inclusion in the response text, greedy or beam search can be used. Setting the generation parameters can significantly affect the diversity of the result.

Language modeling (Causal and Masking Language Modeling)

The decoder is trained with only the previous tokens as input. Its important property is autoregressive prediction of the next token based on language causality. This is functionally similar to a recurrent neural network (RNN), but with a less complex backpropagation operation. There is a paper on the ability of decoders to model RNNs [5].

BERT (Bidirectional Encoder Representations from Transformers) is an encoder architecture. This architecture uses masked words in the input for unsupervised learning. [6]the second stage is supervised learning, pioneered by Open AI [7]. They explain that the computational complexity of training a basic language model is significant, while subsequent supervised retraining for specific tasks is significantly cheaper. Among other things, BERT outperformed OpenAI's GPT model at the time of the paper's publication in 2018. Ultimately, the GPT decoder architecture, with its autoregressive and causal properties, showed better results for the text generation task. The encoder architecture is widely used for text classification, named entity recognition, etc. [8]. Figure 2 shows a visualization of different transformer architectures: decoder, encoder, and encoder-decoder.

Scheme 2. Hierarchy of transformer architectures.

Scheme 2. Hierarchy of transformer architectures.

With the advent of GPT and BERT architectures, research in the field of language models has split into two directions: pre-training a model to understand the relationships between tokens and supervised retraining for a specific applied task. The tasks here are: text generation based on a chat template, text classification, paraphrasing, etc.

Supervised learning

In the article Open AI [7] supervised learning is proposed to optimize the model for specific applied tasks, such as answering questions, determining the degree of connection between two text sequences, semantic similarity of text sequences, text generalization, etc. Retraining is carried out on prepared question-answer pairs by masking the question and retraining using the next token as a target. I conducted experiments on retraining with models such as Mistral and LLaMa, they are initially prone to abstract generalization. Among other things, I found that the model mosaicml/mpt-7b tends to extractive generalization. I tried to retrain it using the Hugging Face libraries: TRL, transformers, peft and bitsandbytes on the dataset Samsung/samsumwhich is basically an abstract summation. It was easy to achieve significant improvement in the performance of the abstract summation task after just one epoch.

The transformers library class for running the CausalLM model is called AutoModelForCausalLM. The LoRA retraining configuration class is called LoraConfig, and the trl library class for running training is called SFTTrainer. There is also a good practical example of retraining Phi-2 with a medical dataset. [9].

Reinforcement Learning with Feedback (RLHF)

In 2022, Open AI made significant progress with the publication of the InstructGPT model [10]. Their goal was to align language models with user intent, making them more useful, accurate, and secure. The approach was simple: multiple answers to the same question were collected and presented to humans for ranking (Figure 3). The basic idea of ​​the algorithm is to evaluate Kullback-Leibler distances between the desired model response and the rejected one. This discrepancy is then interpreted as a model error metric and propagated to the trained model in the form of weight adjustments.

Scheme 3. Feedback from people.

Scheme 3. Feedback from people.

In 2023, the Direct Preference Optimization algorithm was published [11]. The essence of this algorithm is greater stability and less computational intensity of training. Currently, RLHF research is developing very quickly. Recently, an article was published with the algorithm multiturn chat optimisation [12]. Almost all opensource models are currently trained in three stages: pre-training for language modeling, supervised training for applied tasks, and RLHF optimization of user preferences. [13]. From my understanding, RLHF does not improve quality much, it helps to reduce variability and make the model's responses more stable, bringing them in line with user preferences.

Low-Rank Adaptation

In 2021, the LoRA algorithm was published for retraining language models [14]. It brings the possibility of retraining a small part of the model parameters, with a slight drop in accuracy, compared to full retraining. Full retraining of the model is a memory-intensive operation due to the need to store the error gradient for each model parameter. This requires much more memory than model inference. Currently, the most used optimizer for language models is AdamW [15]the essence of the method lies in the separate regularization of the attenuation of weights.

QLoRA is a technique for quartizing frozen model weights, it made retraining even more accessible by reducing the amount of memory required to run the model. This allowed the model to be run with 8-bit precision without significant loss of inference accuracy. [16]. In practice, the training parameters and gradients use higher precision (16 or 32 bits), while the remaining parameters can be 8-bit or 4-bit, with minimal loss of precision. However, running the model in 4-bit mode usually results in a more significant loss of precision than in 8-bit mode. To preserve precision, various techniques are used, such as weight scaling, etc. These techniques are implemented in the library peft. Another important library for running quantized-precision models on CUDA cores is bitsandbytesThere are three main types of LoRA, shown in Figure 4. [19]

Figure 4. Different types of LoRA.

Figure 4. Different types of LoRA.

A recent paper comparing full retraining and parameter-efficient retraining suggests that LoRA also serves as a natural regularization technique against catastrophic forgetting during full retraining. [17]. In my experiments, LoRA gives the best results on models with at least 7B parameters, while my attempts to retrain a GPT-2 model with 1.5B and 774M did not give decent results. The most recent research on the application of LoRA solves the problem of retraining large Mixture-of-Experts (MoE) language models, complementing the separate tuning of the routing part of the MoE architecture. [18].

Synthetic dataset

Success in retraining large language models depends on the quality of the data. Various techniques for data augmentation and transformation, as well as quality measurements, were thoroughly studied in the paper [20]. It is noteworthy that Microsoft conducted a study on generating a dataset for additional training of the language model with instructions [21]The specific prompts used in the study are presented in their paper.

There is a detailed step-by-step guide on how to transform a list of documents into a dataset of question-answer pairs. [22]. The most common use case for supervised learning is to first partition, then group context, then generate queries and generate responses to those queries using a language model.

An example of successful retraining of a language model for the task of converting text into a Cypher query using the Neo4j knowledge base can be found here [23]. The task of additional training for the task of generation from SQL text was successfully implemented [24] based on public SQL query datasets.

Links

  1. QWEN TECHNICAL REPORT — 2023 — https://arxiv.org/pdf/2309.16609

  2. What Does BERT Look At? An Analysis of BERT's Attention — 2019 — https://nlp.stanford.edu/pubs/clark2019what.pdf

  3. Neural Machine Translation of Rare Words with Subword Units — 2016 — https://arxiv.org/pdf/1508.07909

  4. Decoder-Only Transformers: The Workhorse of Generative LLMs – https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse

  5. How Powerful are Decoder-Only Transformer Neural Models? — 2023 — https://ar5iv.labs.arxiv.org/html/2305.17026v3

  6. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding – 2019 – https://arxiv.org/pdf/1810.04805

  7. Improving Language Understanding by Generative Pre-Training — 2018 — https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

  8. Understanding Encoder And Decoder LLMs – https://magazine.sebastianraschka.com/p/understanding-encoder-and-decoder

  9. Fine-tune a Large Language Model — https://medium.com/@prasadmahamulkar/fine-tuning-phi-2-a-step-by-step-guide-e672e7f1d009

  10. Training language models to follow instructions with human feedback — 2022 — https://arxiv.org/pdf/2203.02155

  11. Direct Preference Optimization: Your Language Model is Secretly a Reward Model — 2023 — https://arxiv.org/pdf/2305.18290

  12. Multi-turn Reinforcement Learning from Preference Human Feedback — 2024 — https://arxiv.org/pdf/2405.14655

  13. RLHF Workflow: From Reward Modeling to Online RLHF — 2024 — https://arxiv.org/pdf/2405.07863

  14. LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS — 2021 — https://arxiv.org/pdf/2106.09685

  15. DECOUPLED WEIGHT DECAY REGULARIZATION – 2019 https://arxiv.org/pdf/1711.05101

  16. QLORA: Efficient Finetuning of Quantized LLMs – 2023 – https://arxiv.org/pdf/2305.14314

  17. LoRA Learns Less and Forgets Less — 2024 — https://arxiv.org/pdf/2405.09673

  18. Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models – 2024 – https://arxiv.org/html/2407.01906v1

  19. Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey – 2024 – https://ar5iv.labs.arxiv.org/html/2403.14608

  20. Generative AI for Synthetic Data Generation: Methods, Challenges and the Future — 2024 — https://ar5iv.labs.arxiv.org/html/2403.04190v1

  21. WizardLM: Empowering Large Language Models to Follow Complex Instructions – 2023 – https://arxiv.org/pdf/2304.12244

  22. Using LLMs for Synthetic Data Generation: The Definitive Guide — https://www.confident-ai.com/blog/the-definitive-guide-to-synthetic-data-generation-using-llms

  23. SyntheT2C: Generating Synthetic Data for Fine-Tuning Large Language Models on the Text2Cypher Task — 2024 — https://arxiv.org/pdf/2406.10710

  24. FINE-TUNING LANGUAGE MODELS FOR CONTEXT-SPECIFIC SQL QUERY GENERATION — 2023 — https://arxiv.org/pdf/2312.02251

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *