Mastering T5 (text-to-text transfer transformer). Fine-Tuning

It happens that when studying material from a training article, something does not work, although the codes are copied directly from the article.
In this case, based on the training article, Fine-Tuning of the T5 model was done (text-to-text transfer transformer) on the machine translation task, and overall everything worked out.

Original tutorial article on HuggingFace was suggested by colleagues in the chat, for which thanks accordingly..

We work at Colab.
We translate from English into French.

Install

Installation of libraries.
Attention: the usual installation of transformers produces an error, a correction to the transformers option is needed[torch]

!pip install transformers[torch] datasets evaluate sacrebleu

Dataset

We download the data and divide it into training and test sets.

from datasets import load_dataset
books = load_dataset("opus_books", "en-fr")
books = books["train"].train_test_split(test_size=0.2)

You can see an example of a pair like this:

books["train"][0]

Preprocess

Colab worked correctly t5-small And t5-base.
On t5-large the standard way was not enough. memory.

checkpoint = "t5-small"

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

source_lang = "en"
target_lang = "fr"
prefix = "translate English to French: "

def preprocess_function(examples):
    inputs = [prefix + example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs

tokenized_books = books.map(preprocess_function, batched=True)

from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

Evaluate

Add the BLEU metric.

import evaluate
metric = evaluate.load("sacrebleu")

import numpy as np

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

Login

For this additional training, you need to log in by entering a token with the rights “wtite”.

from huggingface_hub import notebook_login
notebook_login()

Train

Loading the selected pre-trained model

from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

And we form a new one.
Compared to the article, “overwrite_output_dir=True” has been added in case of overwriting during failures.

new_model="my_t5_small_test"

training_args = Seq2SeqTrainingArguments(
    output_dir=new_model,
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=2,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_books["train"],
    eval_dataset=tokenized_books["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

The additional training was successful.

Once completed, place the model on HuggingFace.

trainer.push_to_hub()

Inference

text = "translate English to French: What is your name?"

from transformers import pipeline
translator = pipeline("translation", model=new_model)
translator(text)

>>> [{'translation_text': 'Quel est votre nom ?'}]J,hf

The handler reports that it wants “translation_XX_to_YY” instead of “translation”, but no fundamental difference was noticed during the correction.

More about your own data

Another dataset was created “manually” for example, not according to the article.

texts = [
    {'en': 'The Wanderer', 'fr': 'Le grand Meaulnes'},
    {'en': 'Hello', 'fr': 'Bonjour'}
    ]

data_dict = {'id': [ key for key in range(len(texts)) ], 'translation': texts}
my_dataset = Dataset.from_dict(data_dict)
dataset_dict = DatasetDict({"train": my_dataset})

It all worked that way too.
From this fragment it is clear how to create a dataset regardless of what form the pairs are in initially. In any case, it is possible to form a texts array using loops and some transformations.

Questions that remain unclear

The original tutorial article suggested 2 epochs.
This is enough to demonstrate and test the functionality, but for a practical result it is unclear at what point the model will translate exactly as it is contained in the additional dataset, and not as it translated before. This can probably only be determined by specific examples and comparison of tables.

Notes

If you find an inaccuracy in the article, or find it useful to add something, please report it in the comments.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *