LLM development cycle
In this article I use my experience in training large language models (watch the video series on the channel Ruslan Dev) to identify and describe the main phases of developing your own LLM.
To date, developers GenAI models, as a rule, use the weights of foundational models, rather than training the neural network from scratch. Responses are often used as training data state-of-the-art LLM, such as GPT-4. This approach has become widespread since the creators Stanford Alpaca showed that the inference of a small model like Llama 7B can be brought closer in quality to GPT-3 by fine tuning the responses of the latter.
Since then, both commercial and open source models have moved forward. I worked with the basic model Llama-3, trained on an unprecedentedly huge corpus of 15 trillion text tokens, which provides ample opportunities for fine tuning. And I collected the training dataset using the latest model OpenAI – GPT-4o. As you can see, the variables have changed, but the equation remains the same – the Alpaca approach still works.
The first phase of creating an LLM is data collection – I have already mentioned, it is actually followed by education (more precisely, fine tuning), evaluation (assessment) and quantization. The AI industry is still emerging, and there is no single standard among the tools for LLM development – for each of these phases, programmers use what suits them. I believe that having a framework that offers a standard implementation instead of messy scripting will not hurt innovation – quite the contrary. Therefore, I implemented the two most computationally complex development phases – training and model quantization – in my framework gptchain.
As for data collection, this process in the case considered comes down to making requests to the GPT-4o API, for which I wrote a separate script – it is attached to my dataset published on Huggingface.
I took the prompts from a multilingual dataset Tagengo. At the same time, I decided to filter out prompts to which the model responds like – sorry, I can’t help with that. Most often this is because the prompt is not censored.
I made a selection of prompts in several languages. I'm interested in Russianand also I decided to add English And Chinese. I set the sample size for each of these languages to a thousand examples in order to start with a small dataset and see what happens.
After receiving a response to each prompt, all dialogs are merged using Pandasand then converted to a dataset in the format hugging face. As you can see, the data collection script turned out to be quite simple in functionality. The idea is that the loudly announced GPT-4o model with improved multilingual capabilities will help significantly improve the skills of Lama-3 to cope with tasks on following instructions.
For the training phase I used a virtual machine with GPU NVIDIA H100 on a cloud service immers.cloud. How to set up a virtual machine, install the python virtual environment and the necessary packages, I recommend reading in this article or look video.
I added support for my new dataset in gptchainwhich, like Tagengoloaded from hugging face V ShareGPT format and then converted by my framework into ChatML.
I taught for 3 eras. Since the dataset is small – 3 thousand examples, 3 epochs did not take much time – about half an hour. Please note that this is for Llama-3 8B, QLoRA with 4-bit quantization.
As shown wandb metrics, after three training eras train loss decreased from 1.4 to 0.8.
Before moving on to the next phase – evaluating the model on a bench, let's compare its inference with my previous experiments in order to feel the difference purely subjectively.
After fine tuning on the dataset Tagengo (consisting of GPT-4 responses):
Me: What does a neural network consist of?
AI: A neural network consists of several main components that work together to process and analyze data. Here are the main components of a neural network:
Neurons (or nodes): These are the basic computing units of a neural network. Neurons receive input, process it, and generate output. […]
After fine tuning on my dataset (on the answers GPT-4o):
Me: What does a neural network consist of?
AI: A neural network consists of a set of nodes, or neurons, that mimic the functioning of biological neurons. These neurons are connected to each other through synaptic connections, which are weights or coefficients that determine the strength of the signals transmitted. The main components of a neural network include:
Input neurons: accept data for processing. Depending on the task, the data can be in the form of numerical values, images, sound signals, etc. […]
I'll leave it to you to judge which answer is better. It is important to cancel that to generate the first response we used Llama-3 70Bsecond – Llama-3 8B.
Now it's time to evaluate my new model. I used multilingual MT-Benchand generating answers to test questions in English and Russian took me longer than the fine tuning itself.
By the way, technical instructions, how to run multilingual MT-Bench on your modellook in this article.
The result of my model on the English benchmark was 6.6, which is significantly worse than that of the same Suzumewhich is not surprising – there are only a thousand English-language answers in my dataset, versus 15 thousand in the dataset Tagengo.
I also tried the Russian-language benchmark ru_mt_bench. On it I got 7.3, not so bad, close to GPT-3.5but still weaker Suzume.
Let's summarize. For my experiment, one Virtual Machine with one H100, but what would this process look like if it were a serious production? Obviously, I would add model parallelism – for more serious training the lama would be paralleled by several H100 at the same time. I did not touch on the last phase – quantization, which is needed to deploy the model to the user’s computer or server. Good for this GGUF formatsupported by the library llama.cpp.
One can imagine that for “LLM factories” similar to what works behind closed doors of “Open” AI, all these phases must be automated – including the data pipeline and benchmarks. This is easier to do in the cloud with a framework like gptchain, but still remains a non-trivial task. All the development phases described are LLM can be seen in action – in my last video.