Telegram bot with language model trained on 2ch

If you want to dilute communication in Telegram chat with ridiculous, but often well-aimed and funny comments, or you are looking for information on integrating a language model into a bot, or you want to train language models yourself on data from 2chthen this article describes the steps on how to do it.

Bot

launched botwhich can be added to chats, and he will respond to messages as to posts on 2ch.hk/b/.

For this:

More in order:

Education

Hugging face

The easiest way to train a language model is to use the library transformers. It provides tools for automated learning and application of neural networks (including language models).

Also in their archives you can find a lot of pre-trained models and datasets, which greatly simplifies training, because training a model from scratch is costly, and re-training is much easier.

base model

Take the dialogue model from list of finished models. Models are divided by languages ​​and tasks, and it just so happened that there is exactly one Russian-language model. Language models, of course, are universal things, and you can make a dialog model not from a dialog one, but the closer the area of ​​​​the pre-trained model to the target one, the better.

The model has been selected Grossmend/rudialogpt3_medium_based_on_gpt2 because of her size. 1.3B parameters – the size at which the model can generate meaningful texts, but not too large.

Data

To train the model, data was collected from 2ch.hk/b/. I have been looking for a ready-made dataset for a long time, but I did not find anything suitable, and decided to collect the data myself. Used to collect data api2ch. Threads were loaded, parsed, cleaned up and converted to a dialogue format.

The final dataset consisted of about 60k medium length dialogues 3 – enough for additional training of a medium-sized model.

Dialog example (messages from last to first):

{
  "dialogue": ["Рад слышать!", "Хорошо!", "Как дела?"]
}

The code for collecting and cleaning data can be found at GitHub. The dataset can be found at HuggingFace.

Data filtering

To increase the toxicity of the data, the data was filtered using a classifier model sismetanin/rubert-toxic-pikabu-2ch. The model was created to moderate toxic content, but no one bothers to use it for evil.

Data toxicity:

count

63187.000000

mean

0.675554

25%

0.487243

fifty%

0.721271

75%

0.928254

Was to take 75% percentile toxicity, corresponding to 0.93/1.00 on a toxicity scale.

Learning process and result

Jupiter Notebook With The learning code can be found at GitHub.

finished model can be found on HuggingFace

Model response example before retraining:

Hello!

Hello!

And after:

Hello!

>all you can do is not be a jerk…

The training went well.

Raising the model and connecting to the bot

API

Was written the simplest server Flask to work with the model.

POST request: {"text": "Привет!"}
Response: {"toxified": "Пока!"}

At HuggingFace excellent documentationand details about the launch of models should be looked for there.

The model itself runs on CPU with multithreading PyTorch. So I thought it would not be worthwhile to build API with queues and performers. The model runs on CPU and it would be possible to get a serious acceleration due to scripting models, but I could not match scripting With generation tools HuggingFaceso from scripting had to refuse.

Metrics

For beauty, the collection of metrics is configured and grafana.

Python Telegram API

Good detailed posts have already been written about him (for example).

Let me just say that it can work asynchronously, and while waiting for a response from the model, the program can process other requests.

The code

All code for data collection, model training and bot is made publicly available on GitHub. For ease of use, configured to raise the bot with docker-compose.