Additional training ruGPT-3.5 13B with LoRA
On the GitHub page of the specified project, I discovered Jupyter Notebook tune_llama_7b.ipynb with detailed instructions on how to further train LLaMA 2 7B, but there was nothing similar about ruGPT-3.5, although the GigaSaiga model and configuration were mentioned gigasaiga_13b.json which I decided to use as a basis for my experiments.
And so, let’s create a directory in which we will perform all actions:
mkdir ruGPT-3.5-training
cd ruGPT-3.5-training
For further work we will need Python 3.10, although it’s possible that everything will work fine on 3.11, I haven’t tested it. In addition, you need the Python VirtualEnv module (I prefer this solution, since it’s not a big fan of Conda) and, of course, Nvidia drivers, including CUDA (I trained on 12.2).
Let’s create a virtual environment and switch to it:
python3 -m venv venv
source venv/bin/activate
Let’s install the dependencies, here is an example file requirements.txt
pip install -r requirements.txt
Now we clone the rulm repository:
git clone https://github.com/IlyaGusev/rulm.git
Next, let’s copy some configuration files:
mkdir {configs,internal_prompts,output,output_ggml}
cp rulm/self_instruct/configs/gigasaiga_13b.json configs/rugpt35_13b.json
cp self_instruct/internal_prompts/gigasaiga.json internal_prompts/rugpt35.json
Then in the file configs/rugpt35_13b.json
let’s correct the field model_name
on ai-forever/ruGPT-3.5-13B
.
Training the model
The whole training consists of four simple steps, but if you only need to get the LoRA layer and you don’t need the GGML version of the model, then you won’t need to complete the last two steps.
Step 1 – Preparing Datasets
To train most models, you need to have datasets: training and validation, but to do them you need to first prepare some kind of general, large dataset, but where to get the data? The rulm project comes to the rescue again; it has three scripts for creating datasets from data prepared in advance by the rulm team.
Each of them collects a specific version of the dataset used for training different versions of models of the Saiga family, but personally I was most interested in create_chat_set.pysince he assumed merging 7 different datasets at once and preparing them in such a way that after training on them a model like ChatBot would be obtained, here is the complete list:
By the way, original GigaSaiga was trained on 6 of them, the dataset was not used gpt_roleplay_realmit plays out funny and non-standard game scenarios of communication between the model and the user.
Let’s try to download all the mentioned datasets and collect them into one large dataset, then mix them and divide them into training and validation samples. Let’s create a script for this, let’s call it say 1_dataset.py
and fill it with the following content:
import subprocess
from pathlib import Path
# Set up paths
content_dir = Path('.').resolve()
train_full_path = content_dir / 'train_full.jsonl'
val_full_path = content_dir / 'val_full.jsonl'
# Run create_chat_set script from rulm
module_directory = Path('rulm/self_instruct').resolve()
subprocess.run(
['python', '-m', 'src.data_processing.create_chat_set', str(train_full_path), str(val_full_path)],
cwd=module_directory,
check=True
)
# Check if train_full.jsonl exists
if not train_full_path.exists():
raise FileNotFoundError(f"{train_full_path} does not exist")
Source here.
Let’s run it and wait a while:
python3 1_dataset.py
As a result, two files will appear in the project root:
train_full.jsonl – a complete training dataset, containing approximately 59 thousand different documents designed in the form of chats between the user and the neural network.
val_full.jsonl – a complete validation dataset containing almost 3 thousand documents.
Step 2 – Training the Model
The datasets are ready, which means that you can start training the model itself; the torch, transformers and peft libraries will be used for this. In a nutshell, at this stage we need to download the model ruGPT-3.5-13B
from the repository on HuggingFace, then copy the configurations to the folder output
and then make minor edits to them.
Full script code 2_train.py
I won’t give it here, you can study it herelet us dwell only on the stage of directly starting the learning process:
import json
from huggingface_hub import snapshot_download
from pathlib import Path
import subprocess
content_dir = Path('.').resolve()
original_config_path = content_dir / 'configs/rugpt35_13b.json'
model_dir = content_dir / "ruGPT-3.5-13B"
base_model = "ai-forever/ruGPT-3.5-13B"
output_dir = content_dir / 'output'
config_path = content_dir / 'configs/rugpt35_13b_colab.json'
# Paths to datasets
train_full_path = content_dir / 'train_full.jsonl'
train_small_path = content_dir / 'train.jsonl'
train_path = train_full_path # change to train_full_path if you need
val_full_path = content_dir / 'val_full.jsonl'
val_small_path = content_dir / 'val.jsonl'
val_path = val_full_path # change to val_full_path if you need
# Download binaries
snapshot_download(repo_id=base_model, local_dir=model_dir, ignore_patterns=["LICENSE", "README.md", ".gitattributes"])
...
# Load configurations
with original_config_path.open('r') as fp:
config = json.load(fp)
# Colab adjustments
config['trainer']['per_device_train_batch_size'] = 2
config['trainer']['per_device_eval_batch_size'] = 1
config['trainer']['gradient_accumulation_steps'] = 128
config['trainer']['eval_steps'] = 50
config['trainer']['save_steps'] = 50
config['max_tokens_count'] = 1000
#config['model_name'] = str(model_dir)
config['templates_path'] = str(content_dir / 'internal_prompts/rugpt35.json')
config['load_in_8bit'] = True
config['load_in_4bit'] = False
# Demo adjustments
config['trainer']['eval_steps'] = 2
config['trainer']['logging_steps'] = 1
config['trainer']['num_train_epochs'] = 1
with config_path.open('w') as fp:
json.dump(config, fp, indent=4)
# Run training
module_directory = Path('rulm/self_instruct').resolve()
subprocess.run(
[
'python', '-m', 'src.train',
'--config-file', config_path,
'--train-file', train_path,
'--val-file', val_path,
'--output-dir', output_dir,
'--report-to', 'none'
],
cwd=module_directory,
check=True
)
...
The code shows that the module is being launched src.train
in the context rulm/self_instruct
options are passed to the input that set values for configuration files, datasets and the directory in which the result will be compiled.
Let’s launch it with the command:
python3 2_train.py
On my RTX 4090, the training took about 26 hours and required about 19GB of VRAM, so I had to close many applications using the video card. By the way, you can slightly reduce the amount of required VRAM to 13GB; for this you will need to switch on-the-fly quantization to load_in_4bit
:
config['load_in_8bit'] = False
config['load_in_4bit'] = True
But I personally have not tested this possibility, since I believe that the quality of model training may deteriorate.
As a result of the script running, the following files will appear in the output directory:
adapter_config.json
adapter_model.bin
added_tokens.json
generation_config.json
merges.txt
README.md
special_tokens_map.json
tokenizer_config.json
vocab.json
We are primarily interested in the first two from the list, in the file adapter_model.bin
are the weights of the LoRA layer, and in adapter_config.json
a configuration that contains information about which model the specified LoRA layer was created for, how to apply it, what weights of the original model it acts on, and so on.
For your convenience, I have prepared repository on HuggingFace containing the specified LoRA layer and everything necessary for its correct operation, a test example of the application can be viewed here.
Step 3 – Merging the LoRA layer and the base model
This step is intermediate, but it is necessary in order to subsequently obtain the GGML version of the model. To perform the layer merging procedure, the rulm team has prepared a special script called convert_to_native.py, but unfortunately it is not compatible with ruGPT-3.5, as it is optimized for working with the LLaMA architecture. In general I had to modify a little. You need to put it in the root of the project with the appropriate name, then create a file 3_merge.py
with the following content:
from pathlib import Path
from convert_to_native import convert_to_native
content_dir = Path('.').resolve()
output_dir = content_dir / 'output'
merged_path = output_dir / 'pytorch_model.bin'
convert_to_native(
model_name=str(output_dir),
output_path=str(merged_path),
device="cpu",
enable_offloading=True
)
assert merged_path.exists()
Source here.
The script infuses the LoRA layer into the base ruGPT-3.5 model (loaded in float32). To run the script you will need approximately 60GB of RAM, since the merging takes place in the system RAM, as can be seen from the option device="cpu"
.
Let’s run it:
python3 3_merge.py
As a result, in the directory output
the file will appear pytorch_model.bin
and will weigh approximately 56GB; the merging procedure takes approximately 10-15 minutes.
The most interesting point is that the specified file can already be used to perform inference tasks, just point to AutoModelForCausalLM
(transformers package) path to folder output
.
Usage example here.
Step 4 – Creating GGML Models
We are all ready to begin the transformation. pytorch_model.bin
to GGML format, for this we will use the library llm-rs-pythonwhich is a python wrapper for the library llmwritten in Rust.
Let’s create a file 4_ggml.py
and fill it with the following code:
from llm_rs.convert import AutoConverter
from llm_rs import AutoQuantizer, QuantizationType, ContainerType
from pathlib import Path
content_dir = Path('.').resolve()
input_dir = content_dir / 'ruGPT-3.5-13B-lora'
output_dir = content_dir / 'output_ggml'
# Convert the model to fp16 format
converted_model = AutoConverter.convert(input_dir, output_dir)
# Quantize the model to different formats
AutoQuantizer.quantize(converted_model, quantization=QuantizationType.Q4_0, container=ContainerType.GGML)
AutoQuantizer.quantize(converted_model, quantization=QuantizationType.Q4_1, container=ContainerType.GGML)
AutoQuantizer.quantize(converted_model, quantization=QuantizationType.Q5_0, container=ContainerType.GGML)
AutoQuantizer.quantize(converted_model, quantization=QuantizationType.Q5_1, container=ContainerType.GGML)
AutoQuantizer.quantize(converted_model, quantization=QuantizationType.Q8_0, container=ContainerType.GGML)
Let’s run the script; it will take about 30 minutes to convert to all formats:
python3 4_ggml.py
The code shows that first the model is converted into a format compatible with GGML, in addition, weights are converted from float32 in float16, then the converted model is saved in the directory output_ggml
with title ruGPT-3.5-13B-lora-f16.bin
After the conversion, the quantization procedure is launched, as a result we get 5 versions of the model in GGML format, which can be launched, for example, with the gpt-2 binary file collected as part of the project ggml or with the help of llm, or llm-rs-python and so on. All models will be saved in the directory output_ggml
.
An example of using these models here.
By the way, I prepared repository on HuggingFace, so you can feel them already.
Acknowledgments
Finally, I would like to express my sincere gratitude to the following authors and teams:
Team Sber AI for the original model
ruGPT-3.5 13B
.IlyaGusev and the project team rulm for the datasets and scripts used to train models of the Saiga family.
ggerganov and the project team ggml for the documentation and source codes that helped me figure out how to properly run models on the processor.
iashchak for the repository ruGPT-3.5-13B-ggml on HuggingFace, studying this repository helped me find the project llm-rs-python.
Conclusion
Well, here we are at the finish line, I hope this article will be useful for everyone who is interested in training deep neural networks and plans to use the ruGPT-3.5 13B model in their research and projects.
I wish you success in your machine learning endeavors!
PS. The solution described in this publication is compatible with models mGPT-13Bsince fundamentally their architecture is no different from ruGPT-3.5.