Markov chains in Telegram bot

Sooner or later I had to start writing articles for Habr.

Hi, my name is Alexander and I am a hardcore self-taught AI (artificial intelligence) guy. 5 years ago I set out to create a strong artificial intelligence (SAI).

A visual example of Markov chains

A visual example of Markov chains

I think I should start with how I created a Telegram bot with Markov chains.

This is not a guide, but there is a step by step. Mostly it is my little story about how I read 10 articles and decided to write my first serious chatbot. All links are below.

We perceive this as scientific knowledge.

Why?

I set out to create a text generator and this is what came out of it. Having studied the issue, it became clear that the most banal text generator is an algorithm based on Markov Chains.

Why did I decide to put this in a telegram bot? Because I can!

What are Markov Chains

Markov chains are an algorithm that, using previous information, predicts the future word and constructs a thoughtless text.

How does this work?

Relatively simple. Let's take a tongue twister as an example. This will become our corpus, on the basis of which our future text will be generated. The text consists of 19 words, 8 of which are unique, these are links.

Example

Example

The algorithm counts each word to form future pairs and creates a dictionary

Dictionary

Dictionary

Next, the chain itself is created, where all previous words are taken into account. The numbers indicate the number of paths. We can see all possible paths and the final version from here. At this stage, the algorithm creates an array with all the chains.

Compiling a Markov chain

Compiling a Markov chain

Creating an algorithm

I took all the code herebut I modified it so that it could be conveniently used in the future. I write in python. Let's create a file ChainMarkov.py and write the following code there:

import numpy as np


def ChainMarkovForFiles(text):
    text = open(text, encoding='utf8').read()
    n_words = 100
    corpus = text.split()

I'll start by importing the numpy library, because it works great with arrays. Let's create a function and add the starting variables

  • numpy – library for working with arrays (must be downloaded)

  • ChainMarkovForFiles – a function that reads a file and returns the generated text to us

  • text – a variable that will receive a text file as input

  • n_words – is a variable that stores the maximum number of words in the final text. It can be anything, but the best 100.

  • corpus – this is a list with all the words from the text

Next we will create a pair generator make_pairs. The generator works better than the function because yield does not store any values. The function generates, forgets, and moves on.

def make_pairs(corpus):
    for i in range(len(corpus) - 1):
        yield corpus[i], corpus[i + 1]

pairs = make_pairs(corpus)
word_dict = {}
  • pairs – a variable that stores all pairs of words

  • word_dict – an empty dictionary to begin with

Let's create a loop that will iterate through all the words in pairs. We will specify as the second element in the pair all the words that can come next, instead of calculating the probability. Then we will simply choose them randomly.

for word_1, word_2 in pairs:
  if word_1 in word_dict.keys():
      word_dict[word_1].append(word_2)
  else:
      word_dict[word_1] = [word_2]

We will randomly select a word to start with, it will be capitalized. Then we will search for a new word until we find a new word with a capital letter.

first_word = np.random.choice(corpus)

while first_word.islower():
  first_word = np.random.choice(corpus)

first_word – a variable that stores the first word

Next we need to make up our links and output them.

chain = [first_word]

for i in range(n_words):
  chain.append(np.random.choice(word_dict[chain[-1]]))

return ' '.join(chain)
  • chain – a variable for the link, where our first word is stored, converted into a link

In fact, that's all, you can already run it by simply calling the function and giving it some txt file. Here's the whole code:

import numpy as np


def ChainMarkovForFiles(text):
    text = open(text, encoding='utf8').read()
    n_words = 100
    corpus = text.split()

    def make_pairs(corpus):
        for i in range(len(corpus) - 1):
            yield corpus[i], corpus[i + 1]

    pairs = make_pairs(corpus)
    word_dict = {}

    for word_1, word_2 in pairs:
        if word_1 in word_dict.keys():
            word_dict[word_1].append(word_2)
        else:
            word_dict[word_1] = [word_2]

    first_word = np.random.choice(corpus)

    while first_word.islower():
        first_word = np.random.choice(corpus)

    chain = [first_word]

    for i in range(n_words):
        chain.append(np.random.choice(word_dict[chain[-1]]))

    return ' '.join(chain)

Next, I wrote a new, but exactly the same function, where the input is text from the keyboard. This is necessary so that the user can send a regular message to the bot and the algorithm can work.

def ChainMarkovForText(input_text):
    n_words = 100
    corpus = input_text.split()

    def make_pairs(corpus):
        for i in range(len(corpus) - 1):
            yield corpus[i], corpus[i + 1]

    pairs = make_pairs(corpus)
    word_dict = {}

    for word_1, word_2 in pairs:
        if word_1 in word_dict.keys():
            word_dict[word_1].append(word_2)
        else:
            word_dict[word_1] = [word_2]

    first_word = np.random.choice(corpus)

    while first_word.islower():
        first_word = np.random.choice(corpus)

    chain = [first_word]

    for i in range(n_words):
        last_word = chain[-1]
        if last_word in word_dict:  # проверяем, есть ли текущее слово в словаре
            next_word = np.random.choice(word_dict[last_word])
            chain.append(next_word)
        else:
            break  # если слов больше нет, выходим из цикла

    return ' '.join(chain)

Developing a bot

First, you need to register a bot in Telegram and get a token. I won't tell you how to do this, because the article will be insanely long. Here's a link to official guide from Telegram.

I will develop the bot using the library telebotyou also need to download it.

Create a file main.py and write the following code:

import telebot
import os
import ChainMarkov

bot = telebot.TeleBot('ВАШ_ТОКЕН')
  • os – a system library that is needed to install the file

  • ChainMarkov – this is our previous file that we created before

  • bot – variable where our token will be stored. Paste it there

'OUR_TOKEN'

'OUR_TOKEN'

Next, we'll write a decorator that processes the /start command and outputs something like: hello, waiting for a message or file from you

@bot.message_handler(commands=['start'])
def main(message):
    bot.send_message(message.chat.id, 'Привет, жду от тебя сообщение или txt файлик, '
                                      'на основе которого я тебе вышлю бездумный текст, '
                                      'используя Цепи Маркова')

Let's create another decorator for the /help command for the sake of decency

@bot.message_handler(commands=['help'])
def main(message):
    bot.send_message(message.chat.id, 'Итак, просто напоминаю, что я создан для того, '
                                      'чтобы придумывать бессмысленный текст. Я жду от тебя любое сообщение '
                                      'от 100 слов до 4096 символов или же обычный txt файлик')

Let's check for the message type, so that for all message types except text and file

@bot.message_handler(content_types=['sticker', 'photo', 'animation', 'video', 'image', 'voice'])
def check(message):
    bot.send_message(message.chat.id, 'Я хочу видеть настоящий текст от 100 слов до 4096 символов!')

Let's create a function that will process the user's message and use the chain algorithm. I use an exception handler just in case.

@bot.message_handler(content_types=['text'])
def text_processing(message):
    try:
        word_count = len(message.text.split())

        if word_count <= 100:
            bot.send_message(message.chat.id, 'Очень мало слов, нужно больше. Давайте от 100 слов')
        else:
            generated_text = ChainMarkov.ChainMarkovForText(message.text)
            bot.send_message(message.chat.id, f'Ваш сгенерированный текст:\n\n{generated_text}')
    except:
        bot.send_message(message.chat.id, 'В общем, я пришел к выводу, что то пошло не так')
  • word_count – word counter from user's message

  • generated_text – our generated text, which we will ultimately send to the user

It will be a bit more complicated with the file processing function. We will need to extract the file, install it, generate the text and delete the file. I made a check to make sure that it was a txt file.

@bot.message_handler(content_types=['document'])
def file_processing(message):
    try:
        file_info = bot.get_file(message.document.file_id)

        if message.document.file_name.endswith('.txt'):
            downloaded_file = bot.download_file(file_info.file_path)

            with open('temp.txt', 'wb') as new_file:
                new_file.write(downloaded_file)

            generated_text = ChainMarkov.ChainMarkovForFiles("temp.txt")
            bot.send_message(message.chat.id, f'Ваш сгенерированный текст:\n\n{generated_text}')

            # Удаляем временный файл
            os.remove('temp.txt')
        else:
            bot.send_message(message.chat.id, 'Это не txt файлик')
    except:
        bot.send_message(message.chat.id, 'В общем, я пришел к выводу, что то пошло не так')

A small code to prevent the bot from turning off

bot.infinity_polling()

Final code

import telebot
import os
import ChainMarkov

bot = telebot.TeleBot('7539804440:AAHNk3lZDvXnUcG9Me3LuLwNlnl2av9YXYM')


@bot.message_handler(content_types=['sticker', 'photo', 'animation', 'video', 'image', 'voice'])
def check(message):
    bot.send_message(message.chat.id, 'Я хочу видеть настоящий текст от 100 слов до 4096 символов!')


@bot.message_handler(commands=['start'])
def main(message):
    bot.send_message(message.chat.id, 'Привет, жду от тебя сообщение или txt файлик, '
                                      'на основе которого я тебе вышлю бездумный текст, '
                                      'используя Цепи Маркова')


@bot.message_handler(commands=['help'])
def main(message):
    bot.send_message(message.chat.id, 'Итак, просто напоминаю, что я создан для того, '
                                      'чтобы придумывать бессмысленный текст. Я жду от тебя любое сообщение '
                                      'от 100 слов до 4096 символов или же обычный txt файлик')


@bot.message_handler(content_types=['text'])
def text_processing(message):
    try:
        word_count = len(message.text.split())

        if word_count <= 100:
            bot.send_message(message.chat.id, 'Очень мало слов, нужно больше. Давайте от 100 слов')
        else:
            generated_text = ChainMarkov.ChainMarkovForText(message.text)
            bot.send_message(message.chat.id, f'Ваш сгенерированный текст:\n\n{generated_text}')
    except:
        bot.send_message(message.chat.id, 'В общем, я пришел к выводу, что то пошло не так')


@bot.message_handler(content_types=['document'])
def file_processing(message):
    try:
        file_info = bot.get_file(message.document.file_id)

        if message.document.file_name.endswith('.txt'):
            downloaded_file = bot.download_file(file_info.file_path)

            with open('temp.txt', 'wb') as new_file:
                new_file.write(downloaded_file)

            generated_text = ChainMarkov.ChainMarkovForFiles("temp.txt")
            bot.send_message(message.chat.id, f'Ваш сгенерированный текст:\n\n{generated_text}')

            os.remove('temp.txt')
        else:
            bot.send_message(message.chat.id, 'Это не txt файлик')
    except:
        bot.send_message(message.chat.id, 'В общем, я пришел к выводу, что то пошло не так')


bot.infinity_polling()

Conclusion

That's all for today. We met and I tried to tell and show something interesting. I hope we won't finish here.

I don't know about self-promotion here, but here's mine tg channel
And this is the list literature which I used, there were 9 articles.

P.S.

I can’t answer about originality, but I don’t think I found anything like that in Russian, or maybe I didn’t search well enough.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *