Fine-Tune Transformer Based Model (Rubert) for Text Classification

The task of text classification has long been established in many companies. It is used to determine the mood of customers, separating documents into pre-known topics, detecting fake news, etc. Today I will present a state of the art approach for solving a binary classification problem, namely the detection of messages in which there is a complaint about an employee.

I will also compare the accuracy of two approaches – Fine-Tune Bert and obtaining pre-trained embeddings and their classification using a fully connected neural network.

Approaches for text classification.

There are many approaches to working with text that have both pros and cons. There are simpler solutions that give lower accuracy than more complex approaches, but can be used when time2market needs to be kept to a minimum. There are more difficult approaches, such as the same Fine-Tune Bert, but it gives the highest possible accuracy for the specific nature of the data on which it was trained.

1. Logistic Regression+VECTORIZER.

This is one of the easiest approaches to implement. The Logistic Regression algorithm and similar ones take numeric values ​​as input. Therefore, before training the model, the text must be vectorized. The easiest way to do this is to use a bag of words.

To do this, you can use the Count Vectorizer or its more advanced version, which takes into account the frequency of occurrence of the word TF-IDF Vectorizer.

But to use this approach, you will need to clean up the text, namely, remove stop words, special characters and emoji, and also bring the text to a lemmatized form.

Pros: It’s not resource intensive and gives a pretty good baseline considering it’s one of the easiest ways.

Disadvantages: correct text preprocessing is required and accuracy can be much inferior to more powerful tools, due to the fact that this method does not take into account the semantics of the text and the word order in the sentence. Also, this representation of the text is too sparse.

2. Word2vec and Fasttext

Word2vec is a way to build a compressed space of word vectors that takes a text corpus as input and maps each word to a vector. The vector representation is based on context proximity.

Fasttext is an improvement of Word2vec that uses N-grams of characters, which helps in case of unfamiliar words for models, which in turn has a positive effect on the quality of the model.

Pros: there are a lot of pre-trained models and quite a handy tool – Gensim to work with them.

Cons: Still doesn’t take into account all the semantics of the sentence, and the model data assigns one vector to each word, regardless of context.

3. Obtaining embeddings from pre-trained Bert and their subsequent classification by neural networks.

Bert is a neural network that has shown results by a wide margin on a number of text processing tasks, as it takes into account the word order in sentences and the semantics of texts.

There are also a bunch of pre-trained models (say, huggineface or DeepPavlov) that can be used quite easily to get embeddings. Further, the resulting vectors can be separated, for example, by a linear model, but this will not be very good, since the text vectors obtained in this way are not linearly dependent. This approach will definitely be better than Logistic Regression + VECTORIZER, but still worse than if we separate them with several fully connected layers of a neural network that will take into account non-linear dependencies. More details about this approach can be found herelink).

Pros: Lots of pre-trained Bert models and easy to get embeds. The neural network, when separating vectors, will be able to catch non-linear dependencies in the data, which again will positively affect the accuracy.

Cons: this approach does not take into account the nature of the original text data that we want to classify. For example, Bert has studied a lot of wiki articles, and I need to classify, for example, whether a text written by a certain user is a complaint against an employee of our organization or not. Wiki articles are written in scientific or literary, well-structured language, when messages (complaints in my case) of users have certain semantics and may not be so well structured.

4 Fine-Tune Bert

Each pretrained Bert has its own weights, which were obtained by training it on a large corpus of texts. You can retrain Bert on your text data, thereby changing the weights of the Bert model and better take into account the semantics of the original data.

Pros: with this method, you can get a classification accuracy much greater than that of the methods described above.

Disadvantages: Bert’s retraining operation is quite time-consuming and computationally intensive.

Tweaking rubert-base-cased-sentence

In this section, I will present code for Bert retraining and comparing two approaches: obtaining pretrained embeddings and classifying them using a fully connected neural network or Fine-Tune Bert.

To solve the problem of detecting messages for a complaint against an employee, I will use the data from the previous publication as data (link), but the train and val datasets will be merged into one to increase the size of the training sample.

First, I import the necessary libraries:

import pandas as pd
import numpy as np
import random
import torch
import transformers
import torch.nn as nn
from transformers import AutoModel, BertTokenizer, BertForSequenceClassification
from transformers import TrainingArguments, Trainer
from datasets import load_metric, Dataset
from sklearn.metrics import classification_report, f1_score

Loading data for additional training of the model and test data to check the operation of the future model:

train_df = pd.read_excel('tmp/train.xlsx', engine="openpyxl", index_col = 0)
test_df = pd.read_excel('tmp/test.xlsx', engine="openpyxl", index_col = 0)
train_text = train_df['text'].astype('str')
train_labels = train_df['target']
test_text = test_df['text'].astype('str')
test_labels = test_df['target']

Let’s look at the data:

Setting all seeds:

def seed_all(seed_value):
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value)
        torch.backends.cudnn.benchmark = True
        torch.backends.cudnn.deterministic = False
seed_all(42)

Next, the pre-trained model is loaded, as well as an indication of how many classes the data is divided into. For additional training I will use the GPU:

model = BertForSequenceClassification.from_pretrained('rubert_base_cased_sentence/', num_labels=2).to("cuda")
tokenizer = BertTokenizer.from_pretrained('rubert_base_cased_sentence/')

This model accepts sentences no longer than 512 tokens, so first check what is the maximum length in train and test:

seq_len_train = [len(str(i).split()) for i in train_df['text']]
seq_len_test = [len(str(i).split()) for i in test_df['text']]
max_seq_len = max(max(seq_len_test), max(seq_len_train))
max_seq_len

Bert models accept sentences of the same length as input, so for train and test I will take sentences with a length of 417, and not 512 as the default, in order to reduce data sparseness:

tokens_train = tokenizer.batch_encode_plus(
    train_text.values,
    max_length = max_seq_len,
    padding = 'max_length',
    truncation = True
)
tokens_test = tokenizer.batch_encode_plus(
    test_text.values,
    max_length = max_seq_len,
    padding = 'max_length',
    truncation = True
)

This code wraps the tokenized text data in a torch Dataset:

class Data(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
        
    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor([self.labels[idx]])
        return item
    def __len__(self):
        return len(self.labels)
    
train_dataset = Data(tokens_train, train_labels)
test_dataset = Data(tokens_test, test_labels)

I will write a function to calculate the metric. I use the F1 metric, since the classes are not balanced:

from sklearn.metrics import f1_score
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds)
    return {'F1': f1}

Below are all the parameters that will be used for training:

training_args = TrainingArguments(
    output_dir="./results", #Выходной каталог
    num_train_epochs = 3, #Кол-во эпох для обучения
    per_device_train_batch_size = 8, #Размер пакета для каждого устройства во время обучения
    per_device_eval_batch_size = 8, #Размер пакета для каждого устройства во время валидации
    weight_decay =0.01, #Понижение весов
    logging_dir="./logs", #Каталог для хранения журналов
    load_best_model_at_end = True, #Загружать ли лучшую модель после обучения
    learning_rate = 1e-5, #Скорость обучения
    evaluation_strategy ='epoch', #Валидация после каждой эпохи (можно сделать после конкретного кол-ва шагов)
    logging_strategy = 'epoch', #Логирование после каждой эпохи
    save_strategy = 'epoch', #Сохранение после каждой эпохи
    save_total_limit = 1,
    seed=21)

Passing a pre-trained model, tokenizer, data for training, data for validation and a method for calculating the metric to the trainer:

trainer = Trainer(model=model,
                  tokenizer = tokenizer,
                  args = training_args,
                  train_dataset = train_dataset,
                  eval_dataset = train_dataset,
                  compute_metrics = compute_metrics)

Running model training:

trainer.train()

Saving the trained model:

model_path = "fine-tune-bert"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

Writing a function to get a predictor:

def get_prediction():
    test_pred = trainer.predict(test_dataset)
    labels = np.argmax(test_pred.predictions, axis = -1)
    return labels
pred = get_prediction()

Checking the result:

The output of all the necessary information to assess the quality of the model:

print(classification_report(test_labels, pred))
print(f1_score(test_labels, pred))

As a result, I get F1 = 0.976!!!, this is a pretty cool result that can be improved.

Suggestions for improvement:

1. Study the hyperparameters of the model in more detail and try to play with them, for example, by specifying lr not as a constant, but by making the learning rate decay, etc.

2. Perform cross-validation stacking of this model. This will greatly increase the time for additional training of each model, but in this way it is possible to reduce the prediction variance and increase the accuracy by 0.5-1%.

As a result

This approach increased the F1 metric by almost 10% relative to the approach described in the article linked above. Fine-Tune Bert allows you to classify data of the same nature with very high accuracy as the data on which Fine-Tune was produced, it can also be produced not only for binary classification, but also for multiclass. It is also worth noting that with a decrease in the amount of training data, the metric will drop as expected, but it will still be better than with other approaches, since they do not take into account the nature of specific data.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *