Telegram bot with language model trained on 2ch

If you want to dilute communication in Telegram chat with ridiculous, but often well-aimed and funny comments, or you are looking for information on integrating a language model into a bot, or you want to train language models yourself on data from 2chthen this article describes the steps on how to do it.

Bot

launched botwhich can be added to chats, and he will respond to messages as to posts on 2ch.hk/b/.

For this:

Education

Hugging face

The easiest way to train a language model is to use the library transformers. It provides tools for automated learning and application of neural networks (including language models).

Also in their archives you can find a lot of pre-trained models and datasets, which greatly simplifies training, because training a model from scratch is costly, and re-training is much easier.

base model

Take the dialogue model from list of finished models. Models are divided by languages and tasks, and it just so happened that there is exactly one Russian-language model. Language models, of course, are universal things, and you can make a dialog model not from a dialog one, but the closer the area of the pre-trained model to the target one, the better.

The model has been selected Grossmend/rudialogpt3_medium_based_on_gpt2 because of her size. 1.3B parameters – the size at which the model can generate meaningful texts, but not too large.

Data

To train the model, data was collected from 2ch.hk/b/. I have been looking for a ready-made dataset for a long time, but I did not find anything suitable, and decided to collect the data myself. Used to collect data api2ch. Threads were loaded, parsed, cleaned up and converted to a dialogue format.

The final dataset consisted of about 60k medium length dialogues 3 – enough for additional training of a medium-sized model.

Dialog example (messages from last to first):

{
  "dialogue": ["Рад слышать!", "Хорошо!", "Как дела?"]
}

The code for collecting and cleaning data can be found at GitHub. The dataset can be found at HuggingFace.

Data filtering

To increase the toxicity of the data, the data was filtered using a classifier model sismetanin/rubert-toxic-pikabu-2ch. The model was created to moderate toxic content, but no one bothers to use it for evil.

Data toxicity:

count	63187.000000
mean	0.675554
25%	0.487243
fifty%	0.721271
75%	0.928254

Was to take 75% percentile toxicity, corresponding to 0.93/1.00 on a toxicity scale.

Learning process and result

Raising the model and connecting to the bot

API

At HuggingFace excellent documentationand details about the launch of models should be looked for there.

The model itself runs on CPU with multithreading PyTorch. So I thought it would not be worthwhile to build API with queues and performers. The model runs on CPU and it would be possible to get a serious acceleration due to scripting models, but I could not match scripting With generation tools HuggingFaceso from scripting had to refuse.

Metrics

Python Telegram API

Let me just say that it can work asynchronously, and while waiting for a response from the model, the program can process other requests.

The code

All code for data collection, model training and bot is made publicly available on GitHub. For ease of use, configured to raise the bot with docker-compose.

Telegram bot with language model trained on 2ch

Bot

Education

Hugging face

base model

Data

Data filtering

Learning process and result

Raising the model and connecting to the bot

API

Metrics

Python Telegram API

The code

Spanish flea market: hard drives, monitor mounts and more

Briefly about Nameko Python

How a self-service portal works and why companies benefit from it

You should stop manually writing Dockerfiles

Why do weak employees come to work for you? And why recruiting agencies won’t help you

The world of mathematics through the eyes of AI

Leave a Reply Cancel reply

Bot

Education

Hugging face

base model

Data

Data filtering

Learning process and result

Raising the model and connecting to the bot

API

Metrics

Python Telegram API

The code

Similar Posts

Leave a Reply Cancel reply