Additional training ruGPT-3.5 13B with LoRA

rulmthe author of this publication spoke in detail about how he managed to collect a large Russian-language dataset and perform additional training of many different models, including LLaMA (2) and ruGPT-3.5.

On the GitHub page of the specified project, I discovered Jupyter Notebook tune_llama_7b.ipynb with detailed instructions on how to further train LLaMA 2 7B, but there was nothing similar about ruGPT-3.5, although the GigaSaiga model and configuration were mentioned gigasaiga_13b.json which I decided to use as a basis for my experiments.

And so, let’s create a directory in which we will perform all actions:

mkdir ruGPT-3.5-training
cd ruGPT-3.5-training

For further work we will need Python 3.10, although it’s possible that everything will work fine on 3.11, I haven’t tested it. In addition, you need the Python VirtualEnv module (I prefer this solution, since it’s not a big fan of Conda) and, of course, Nvidia drivers, including CUDA (I trained on 12.2).

Let’s create a virtual environment and switch to it:

python3 -m venv venv
source venv/bin/activate

Let’s install the dependencies, here is an example file requirements.txt

pip install -r requirements.txt

Now we clone the rulm repository:

git clone https://github.com/IlyaGusev/rulm.git

Next, let’s copy some configuration files:

mkdir {configs,internal_prompts,output,output_ggml}
cp rulm/self_instruct/configs/gigasaiga_13b.json configs/rugpt35_13b.json
cp self_instruct/internal_prompts/gigasaiga.json internal_prompts/rugpt35.json

Then in the file configs/rugpt35_13b.json let’s correct the field model_name on ai-forever/ruGPT-3.5-13B.

Training the model

The whole training consists of four simple steps, but if you only need to get the LoRA layer and you don’t need the GGML version of the model, then you won’t need to complete the last two steps.

Step 1 – Preparing Datasets

To train most models, you need to have datasets: training and validation, but to do them you need to first prepare some kind of general, large dataset, but where to get the data? The rulm project comes to the rescue again; it has three scripts for creating datasets from data prepared in advance by the rulm team.

Each of them collects a specific version of the dataset used for training different versions of models of the Saiga family, but personally I was most interested in create_chat_set.pysince he assumed merging 7 different datasets at once and preparing them in such a way that after training on them a model like ChatBot would be obtained, here is the complete list:

By the way, original GigaSaiga was trained on 6 of them, the dataset was not used gpt_roleplay_realmit plays out funny and non-standard game scenarios of communication between the model and the user.

Let’s try to download all the mentioned datasets and collect them into one large dataset, then mix them and divide them into training and validation samples. Let’s create a script for this, let’s call it say 1_dataset.py and fill it with the following content:

import subprocess
from pathlib import Path

# Set up paths
content_dir = Path('.').resolve()
train_full_path = content_dir / 'train_full.jsonl'
val_full_path = content_dir / 'val_full.jsonl'

# Run create_chat_set script from rulm
module_directory = Path('rulm/self_instruct').resolve()
subprocess.run(
    ['python', '-m', 'src.data_processing.create_chat_set', str(train_full_path), str(val_full_path)],
    cwd=module_directory,
    check=True
)

# Check if train_full.jsonl exists
if not train_full_path.exists():
    raise FileNotFoundError(f"{train_full_path} does not exist")

Source here.

Let’s run it and wait a while:

python3 1_dataset.py
Dataset preparation process

Dataset preparation process

As a result, two files will appear in the project root:

  • train_full.jsonl – a complete training dataset, containing approximately 59 thousand different documents designed in the form of chats between the user and the neural network.

  • val_full.jsonl – a complete validation dataset containing almost 3 thousand documents.

Step 2 – Training the Model

The datasets are ready, which means that you can start training the model itself; the torch, transformers and peft libraries will be used for this. In a nutshell, at this stage we need to download the model ruGPT-3.5-13B from the repository on HuggingFace, then copy the configurations to the folder outputand then make minor edits to them.

Full script code 2_train.py I won’t give it here, you can study it herelet us dwell only on the stage of directly starting the learning process:

import json
from huggingface_hub import snapshot_download
from pathlib import Path
import subprocess

content_dir = Path('.').resolve()
original_config_path = content_dir / 'configs/rugpt35_13b.json'
model_dir = content_dir / "ruGPT-3.5-13B"
base_model = "ai-forever/ruGPT-3.5-13B"
output_dir = content_dir / 'output'
config_path = content_dir / 'configs/rugpt35_13b_colab.json'

# Paths to datasets
train_full_path = content_dir / 'train_full.jsonl'
train_small_path = content_dir / 'train.jsonl'
train_path = train_full_path  # change to train_full_path if you need
val_full_path = content_dir / 'val_full.jsonl'
val_small_path = content_dir / 'val.jsonl'
val_path = val_full_path  # change to val_full_path if you need

# Download binaries
snapshot_download(repo_id=base_model, local_dir=model_dir, ignore_patterns=["LICENSE", "README.md", ".gitattributes"])

...

# Load configurations
with original_config_path.open('r') as fp:
    config = json.load(fp)

# Colab adjustments
config['trainer']['per_device_train_batch_size'] = 2
config['trainer']['per_device_eval_batch_size'] = 1
config['trainer']['gradient_accumulation_steps'] = 128
config['trainer']['eval_steps'] = 50
config['trainer']['save_steps'] = 50
config['max_tokens_count'] = 1000
#config['model_name'] = str(model_dir)
config['templates_path'] = str(content_dir / 'internal_prompts/rugpt35.json')
config['load_in_8bit'] = True
config['load_in_4bit'] = False

# Demo adjustments
config['trainer']['eval_steps'] = 2
config['trainer']['logging_steps'] = 1
config['trainer']['num_train_epochs'] = 1

with config_path.open('w') as fp:
    json.dump(config, fp, indent=4)

# Run training
module_directory = Path('rulm/self_instruct').resolve()
subprocess.run(
    [
        'python', '-m', 'src.train',
        '--config-file', config_path,
        '--train-file', train_path,
        '--val-file', val_path,
        '--output-dir', output_dir,
        '--report-to', 'none'
    ],
    cwd=module_directory,
    check=True
)

...

The code shows that the module is being launched src.train in the context rulm/self_instructoptions are passed to the input that set values ​​for configuration files, datasets and the directory in which the result will be compiled.

Let’s launch it with the command:

python3 2_train.py

On my RTX 4090, the training took about 26 hours and required about 19GB of VRAM, so I had to close many applications using the video card. By the way, you can slightly reduce the amount of required VRAM to 13GB; for this you will need to switch on-the-fly quantization to load_in_4bit:

config['load_in_8bit'] = False
config['load_in_4bit'] = True

But I personally have not tested this possibility, since I believe that the quality of model training may deteriorate.

As a result of the script running, the following files will appear in the output directory:

  • adapter_config.json

  • adapter_model.bin

  • added_tokens.json

  • generation_config.json

  • merges.txt

  • README.md

  • special_tokens_map.json

  • tokenizer_config.json

  • vocab.json

We are primarily interested in the first two from the list, in the file adapter_model.bin are the weights of the LoRA layer, and in adapter_config.jsona configuration that contains information about which model the specified LoRA layer was created for, how to apply it, what weights of the original model it acts on, and so on.

For your convenience, I have prepared repository on HuggingFace containing the specified LoRA layer and everything necessary for its correct operation, a test example of the application can be viewed here.

Step 3 – Merging the LoRA layer and the base model

This step is intermediate, but it is necessary in order to subsequently obtain the GGML version of the model. To perform the layer merging procedure, the rulm team has prepared a special script called convert_to_native.py, but unfortunately it is not compatible with ruGPT-3.5, as it is optimized for working with the LLaMA architecture. In general I had to modify a little. You need to put it in the root of the project with the appropriate name, then create a file 3_merge.py with the following content:

from pathlib import Path
from convert_to_native import convert_to_native

content_dir = Path('.').resolve()
output_dir = content_dir / 'output'
merged_path = output_dir / 'pytorch_model.bin'

convert_to_native(
    model_name=str(output_dir),
    output_path=str(merged_path),
    device="cpu",
    enable_offloading=True
)

assert merged_path.exists()

Source here.

The script infuses the LoRA layer into the base ruGPT-3.5 model (loaded in float32). To run the script you will need approximately 60GB of RAM, since the merging takes place in the system RAM, as can be seen from the option device="cpu".

Let’s run it:

python3 3_merge.py

As a result, in the directory output the file will appear pytorch_model.binand will weigh approximately 56GB; the merging procedure takes approximately 10-15 minutes.

The most interesting point is that the specified file can already be used to perform inference tasks, just point to AutoModelForCausalLM (transformers package) path to folder output.

Usage example here.

Step 4 – Creating GGML Models

We are all ready to begin the transformation. pytorch_model.bin to GGML format, for this we will use the library llm-rs-pythonwhich is a python wrapper for the library llmwritten in Rust.

Let’s create a file 4_ggml.py and fill it with the following code:

from llm_rs.convert import AutoConverter
from llm_rs import AutoQuantizer, QuantizationType, ContainerType
from pathlib import Path

content_dir = Path('.').resolve()
input_dir = content_dir / 'ruGPT-3.5-13B-lora'
output_dir = content_dir / 'output_ggml'

# Convert the model to fp16 format
converted_model = AutoConverter.convert(input_dir, output_dir)

# Quantize the model to different formats
AutoQuantizer.quantize(converted_model, quantization=QuantizationType.Q4_0, container=ContainerType.GGML)
AutoQuantizer.quantize(converted_model, quantization=QuantizationType.Q4_1, container=ContainerType.GGML)
AutoQuantizer.quantize(converted_model, quantization=QuantizationType.Q5_0, container=ContainerType.GGML)
AutoQuantizer.quantize(converted_model, quantization=QuantizationType.Q5_1, container=ContainerType.GGML)
AutoQuantizer.quantize(converted_model, quantization=QuantizationType.Q8_0, container=ContainerType.GGML)

Let’s run the script; it will take about 30 minutes to convert to all formats:

python3 4_ggml.py

The code shows that first the model is converted into a format compatible with GGML, in addition, weights are converted from float32 in float16, then the converted model is saved in the directory output_ggml with title ruGPT-3.5-13B-lora-f16.bin

After the conversion, the quantization procedure is launched, as a result we get 5 versions of the model in GGML format, which can be launched, for example, with the gpt-2 binary file collected as part of the project ggml or with the help of llm, or llm-rs-python and so on. All models will be saved in the directory output_ggml.

An example of using these models here.

By the way, I prepared repository on HuggingFace, so you can feel them already.

Acknowledgments

Finally, I would like to express my sincere gratitude to the following authors and teams:

  • Team Sber AI for the original model ruGPT-3.5 13B.

  • IlyaGusev and the project team rulm for the datasets and scripts used to train models of the Saiga family.

  • ggerganov and the project team ggml for the documentation and source codes that helped me figure out how to properly run models on the processor.

  • iashchak for the repository ruGPT-3.5-13B-ggml on HuggingFace, studying this repository helped me find the project llm-rs-python.

Conclusion

Well, here we are at the finish line, I hope this article will be useful for everyone who is interested in training deep neural networks and plans to use the ruGPT-3.5 13B model in their research and projects.

I wish you success in your machine learning endeavors!

PS. The solution described in this publication is compatible with models mGPT-13Bsince fundamentally their architecture is no different from ruGPT-3.5.

Links

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *