How we made a system to save the Internet from toxicity

Online toxicity is a common problem that everyone has encountered. During the period of rapid development of AI, a solution naturally arises to automatically remove toxic patterns while preserving the original meaning and original style of the author. One such approach is the use of NLP seq2seq models, which we train on in pairs (sad sentence; non-toxic sentence):

Dataset slice for training seq2seq detoxification model

Dataset slice for training seq2seq detoxification model

Also, before training the model, it is worth deciding on the metrics by which we will measure quality:

  • Content preservation SIM – cosine similarity between the embeddings of the original offer and the detoxified version.

  • fluency, FL – percentage of natural offers. Calculated using the RoBERTa-base classifier trained on the dataset CoLA.

  • Style accuracy, S.T.A. – the percentage of non-toxic tokens identified using the RoBERTa classifier trained on half of the merged Jigsaw datasets from different years (2018, 2019, 2020).

    For convenience, we have introduced Joint score, J – the product of the three metrics described above. It is on the basis of J that the quality of detoxification will be assessed:

J=\dfrac{1}{n}\cdot \sum_n {STA(y_i)\cdot SIM(x_i, y_i)\cdot FL(y_i)}

n – number of sentences in the eval-dataset;

x – original proposal, y – detoxified version x

For English, there are effective text detox solutions, e.g. this is itdeveloped by my colleagues at Skoltech. But what about a system that detects universal patterns of toxicity without being tied to a specific language? This is exactly the problem we solved at competition PAN-2024. The summit of our solution can be found in English Here.

Data processing and augmentation

For augmentation we used the following strategy:

  1. Translation from English to target language using Deep-translator.

  2. Checking the preservation of meaning during translation using comparison LaBSE embeddings between transferred pairs.

  3. Assessment of toxicity persistence through XLM-R

As a result, the data distribution was leveled and additional samples were generated for low-resource languages.

Detoxification Methods

Supervised fine tuning

As the main approach, we chose additional training of various multilingual language models. We believe that the most promising models for further training are models from the mT0 family. This is a family of Transformer models working on the “sequence2sequence” principle, inherited from mT5. The mT0 family was chosen for its optimal performance with multilingual data, so these models were adapted for each competition language. We also conducted experiments with the new Aya-101 model, which is a pre-trained version of mT5-xl using multilingual instructions.

The training hyperparameters for all models were initialized almost identically. The learning rate was set to 1e-5, the total batch size was 8, and the weight decay factor was 0.01. A cosine scheduler was used for training. In total, all models were trained for 4 epochs. All other training parameters were default according to HuggingFace Seq2SeqTrainer. The only difference was that for mT0-XL we updated the weights of the entire model, since our computing resources allowed it. In case of a larger model such as Aya-101 or mT0-XXL, only the LoRA adapter was trained. The LoRA adapter settings were as follows: r and lora alpha were set to 32, the lora dropout parameter was set to 0.1, and the rest of the parameters were set to default. The best model was selected based on validation losses.

To enhance the models' ability in context, we added a specific prefix to each toxic sentence depending on the language. As a result, we fed toxic sentences with a specific prefix to the model during training.

Selection of candidates for inference

During the inference, we generated 10 hypotheses and selected the 5 most likely ones using the diverse beam search method. The number of rays is 10 with 5 groups of rays, the diversity penalty is 2.5, and the repetition penalty is 1.2. To select the best option, we calculated a relevance metric using the product of similarity and toxicity scores. Similarity was calculated using LaBSE embeddings, and toxicity was assessed using the xlm-roberta-large toxicity classifier. After calculating the relevance scores, we selected the best candidate with the highest score.

ORPO

After fine tuning, we decided to further optimize the model to achieve the best results using the Odds Ratio Preference Optimization method (ORPO). This optimization does not require a reference model, as is the case with DPO. Alignment was performed on an unseen test data set.

As a preferred dataset, we generated hypotheses using a diversified search across samples from the test set and annotated them using the previously described relevance score. Only candidates with the highest relevance scores were selected as the best ones, while all others were classified as rejected samples.

The resulting ORPO alignment dataset contained a query (toxic proposal), a rejected sample (negative candidate), and a selected sample (best candidate). We trained the model on this dataset using the same parameters we used to train other models. The beta parameter for ORPO was set at 0.1. For the final submission, we used the aligned model with the algorithm described above to select the best candidate.

Results

The final results of the automatic assessment are presented in the table below:

The mT0-XL model with ORPO showed the best results among all approaches in the leaderboard for all languages. Compared to mT0-XL, which was before ORPO alignment, the new version slightly improved the model's performance, increasing the average results by 0.01 points. Surprisingly, the larger models did not perform better. For example, the mT0-XXL model with 13 billion parameters performed even worse than the mT0-XL with 3.7 billion parameters. The Aya-101 model, based on the mT5-XXL and additionally tuned to instructional data for different languages, also performed worse than the other models. Because Aya-101 and mT0-XXL performed even worse than mT0-XL, we did not perform the ORPO alignment step for these models. Among other commands in automatic evaluation, our benchmarks, mainly mT0-XL-ORPO and mT0-XL, are the two best approaches for all languages ​​except Chinese.

Conclusion

In conclusion, our system has demonstrated a powerful pipeline for enriching training data for low-resource languages ​​and beyond. In our future research, we may consider how to adapt translation-free capabilities, since machine translation for less common languages ​​tends to be of poor quality. A further direction of research is the interpretability of models and understanding which lexemes were replaced by the model in the process of working on the text and the rationale for this.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *