What is supervised fine-tuning?
Supervised fine-tuning (SFT) is a technique used to adapt pre-trained Large Language Models (LLMs) to a specific task using labeled data.
In the SFT process, pre-trained LLMs are fine-tuned on a labeled dataset using supervised learning techniques. The model weights are adjusted based on gradients derived from a task-specific loss function that measures the difference between the LLM predictions and the reference labeling.
This process allows the model to learn the patterns and nuances of a particular task, adapting its parameters according to the distribution of specific data and the requirements of the task.
SFT, usually performed after the model has been pre-trained, is used to teach the model to follow user-supplied instructions. It is more computationally expensive than unsupervised fine-tuning, but has a better chance of achieving improved accuracy.
The amount of additional training required depends on the complexity of the problem and the size of the dataset. For simple style transfer using OpenAI models like GPT-3.5 or GPT-4, 30-50 high-quality examples are usually enough to get excellent results.
Transforming a basic Large Language Model (LLM) into an instruction-executing LLM (e.g. turning Mistral into Mistral Instruct) typically requires training on tens of thousands of examples.
Further training Zephyr 7b was run on 16 Nvidia GPUs A100 for about four hours. This can be considered an example of a starting point for a model with 7 billion parameters.
How does supervised fine-tuning work?
Supervised fine-tuning is a three-step process used in AI and LLM to optimize models for specific tasks.
- Step 1: Pre-training — the base (fundamental) model is initially trained on a large dataset, learning to understand patterns, grammar, and context of the language, predicting the next word in the current context. This stage helps the model develop a broad understanding of the language.
- Step 2: Data Labeling — a dataset is prepared that is used for additional training. Each data sample is labeled with the correct output data or answer. This labeled data is extremely important for supervised learning, because it guides the model when aligning its parameters during fine-tuning.
- Stage 3: fine-tuning — then the pre-trained model is trained on a dataset of a specific task with labeled data. The model adjusts its parameters to improve the accuracy of that specific task. Such tasks can be text classification, text intonation analysis, and even question answering systems.
The term supervised is used because the process uses labeled data, where the correct output is known in advance, to guide learning. This technique improves the accuracy of the model by applying the broad understanding of language gained in pre-training to a narrow, specific task during retraining.
Benefits of supervised fine-tuning
Supervised fine-tuning (SFT) is a powerful technique used to adapt pre-trained LLMs to specific narrow tasks using labeled data. Supervised fine-tuning has the following advantages:
- Task-specific patterns and nuances — SFT allows the model to learn patterns and nuances of the task, adapting its parameters to the distribution of specific data and the requirements of the task.
- Improving accuracy — fine-tuning a pre-trained model allows it to use knowledge and ideas learned from large amounts of data, which improves the accuracy of more specific tasks.
- Efficiency of data application — SFT can be applied to a wide range of real-world use cases even when the number of labeled examples is limited, allowing for more efficient use of data.
- Efficiency of resource use — fine-tuning of pre-trained models allows for significant savings in time and computing resources that would be required to train a model from scratch.
- Setting — SFT allows you to tailor your LLM behavior, writing style, or subject area knowledge to specific nuances, intonations, or terminologies, providing deep alignment with specific styles or areas of knowledge.
- Reducing overfitting — during fine-tuning, you can use technologies such as early stopping, dropout, and data augmentation to reduce the risk of overfitting on small datasets or to stimulate model generalization to new data.
Common methods of supervised fine-tuning
Popular supervised fine-tuning methods for LLM include LoRA (Low-Rank Adaptation) and its memory-efficient variant QLoRA (Quantized LoRA). These methods belong to the Parameter-Efficient Fine-Tuning (PEFT) family, which aims to improve the efficiency and availability of fine-tuning.
LoRA fine-tuning uses low-rank factorization to represent the weight update as two smaller matrices, which reduces the number of parameters to learn and makes fine-tuning more efficient. It outperforms full fine-tuning in some situations while avoiding catastrophic forgetting, the phenomenon where fine-tuning loses knowledge gained during pre-training.
- LoRA (Low-Rank Adaptation) — a parameter-efficient fine-tuning technique using low-rank decomposition.
- QLoRA (Quantized LoRA) — a memory-efficient variant of LoRA that further reduces memory requirements for fine-tuning LLM models.
Common problems with supervised fine-tuning
The most common problems of supervised fine-tuning of large language models are:
- Retraining — occurs when a model becomes too specific to the training data, resulting in suboptimal generalization to unknown data. This problem is common in machine learning and can also arise during the retraining of LLM models.
- Tuning hyperparameters — choosing inappropriate hyperparameters can lead to slow convergence, poor generalization, or even unstable learning. Mastering the art of hyperparameter tuning can be extremely difficult, costing time and resources.
- Data quality issues — fine-tuning heavily depends on the quality of the data passed to LLM. Lack of knowledge in the model or in the data source can lead to poor fine-tuning results.
- Catastrophic forgetting — This phenomenon occurs when fine-tuning a pre-trained model causes it to forget previously learned knowledge, leading to instability of the fine-tuning process.
- Fluctuating precision — fine-tuning LLM can sometimes result in fluctuating accuracy in edge cases or failure to fit enough few-shot prompts into the context window to control the model.
Many different strategies are used to address common problems in supervised fine-tuning of LLM models. Hyperparameter tuning can be automated using techniques such as grid search or Bayesian optimization, which essentially explore the hyperparameter space to find optimal configurations. Overfitting and catastrophic forgetting can be addressed using techniques such as knowledge distillation. Ensuring the quality and relevance of the data is also critical. Finally, the fine-tuning process should be monitored and evaluated through continuous error analysis and iterative improvement.
Questions and answers
What is the difference between supervised fine-tuning and unsupervised fine-tuning?
Supervised fine-tuning improves a pre-trained model using labeled data by feeding it explicit examples of input and output pairs for training. This targeted approach often provides improved accuracy, but is computationally expensive due to the difficulty of minimizing the loss function.
Unsupervised fine-tuning does not use labeled data, but instead relies on the model's intrinsic ability to detect patterns and structures in the data. This technique is less computationally expensive, but may not reach the accuracy levels of supervised methods due to the lack of direct feedback.
The decision between supervised and unsupervised fine-tuning depends on the specific task, availability of labeled data, budget of computing power, and accuracy requirements. Supervised fine-tuning is usually chosen when maximum accuracy is required, and unsupervised fine-tuning is chosen when resources are limited or labeled data is unavailable.
How is supervised fine-tuning different from transfer learning?
While both techniques adapt a pre-trained model to a new task, supervised fine-tuning typically requires labeled data and focuses on aligning the model's weights to that data. Transfer learning can also use a pre-trained model as a starting point and train it on a new dataset, which may or may not be labeled.
Can supervised fine-tuning be applied to any LLM?
In theory, yes. However, the effectiveness of fine-tuning may depend on factors such as the size of the LLM, the quality and quantity of fine-tuning data, and the similarity between the pre-training tasks and the target task.
What helps avoid overfitting in supervised fine-tuning?
It is recommended to use techniques such as dropout, early stopping, and data augmentation, as well as carefully select hyperparameters and apply regularization methods. It is also important to use a validation dataset to monitor the accuracy of the model and not allow it to learn from the noise in the training data.
How to measure the success of supervised fine-tuning?
The success of supervised fine-tuning is measured by various evaluation metrics that determine the accuracy of the model on a test dataset that it did not work with during training. Standard metrics include accuracy, precision, recall, and F1 score, but ultimately the methods for evaluating LLM depend on the goals of fine-tuning and the nature of the problem.
Did you like the article? You can find even more content on the topics of markup, Data Mining and ML in our Telegram channel “Where is the data, Lebowski?”
- How to prepare for data collection so as not to fail in the process?
- How to Work with Synthetic Data in 2024?
- What is the specificity of working with ML projects? And how to mark 1500 ore bubbles on one photo without going crazy?