Markov chains in Telegram bot
Sooner or later I had to start writing articles for Habr.
Hi, my name is Alexander and I am a hardcore self-taught AI (artificial intelligence) guy. 5 years ago I set out to create a strong artificial intelligence (SAI).
I think I should start with how I created a Telegram bot with Markov chains.
This is not a guide, but there is a step by step. Mostly it is my little story about how I read 10 articles and decided to write my first serious chatbot. All links are below.
We perceive this as scientific knowledge.
Why?
I set out to create a text generator and this is what came out of it. Having studied the issue, it became clear that the most banal text generator is an algorithm based on Markov Chains.
Why did I decide to put this in a telegram bot? Because I can!
What are Markov Chains
Markov chains are an algorithm that, using previous information, predicts the future word and constructs a thoughtless text.
How does this work?
Relatively simple. Let's take a tongue twister as an example. This will become our corpus, on the basis of which our future text will be generated. The text consists of 19 words, 8 of which are unique, these are links.
The algorithm counts each word to form future pairs and creates a dictionary
Next, the chain itself is created, where all previous words are taken into account. The numbers indicate the number of paths. We can see all possible paths and the final version from here. At this stage, the algorithm creates an array with all the chains.
Creating an algorithm
I took all the code herebut I modified it so that it could be conveniently used in the future. I write in python. Let's create a file ChainMarkov.py
and write the following code there:
import numpy as np
def ChainMarkovForFiles(text):
text = open(text, encoding='utf8').read()
n_words = 100
corpus = text.split()
I'll start by importing the numpy library, because it works great with arrays. Let's create a function and add the starting variables
numpy
– library for working with arrays (must be downloaded)ChainMarkovForFiles
– a function that reads a file and returns the generated text to ustext
– a variable that will receive a text file as inputn_words
– is a variable that stores the maximum number of words in the final text. It can be anything, but the best 100.corpus
– this is a list with all the words from the text
Next we will create a pair generator make_pairs
. The generator works better than the function because yield
does not store any values. The function generates, forgets, and moves on.
def make_pairs(corpus):
for i in range(len(corpus) - 1):
yield corpus[i], corpus[i + 1]
pairs = make_pairs(corpus)
word_dict = {}
pairs
– a variable that stores all pairs of wordsword_dict
– an empty dictionary to begin with
Let's create a loop that will iterate through all the words in pairs. We will specify as the second element in the pair all the words that can come next, instead of calculating the probability. Then we will simply choose them randomly.
for word_1, word_2 in pairs:
if word_1 in word_dict.keys():
word_dict[word_1].append(word_2)
else:
word_dict[word_1] = [word_2]
We will randomly select a word to start with, it will be capitalized. Then we will search for a new word until we find a new word with a capital letter.
first_word = np.random.choice(corpus)
while first_word.islower():
first_word = np.random.choice(corpus)
first_word
– a variable that stores the first word
Next we need to make up our links and output them.
chain = [first_word]
for i in range(n_words):
chain.append(np.random.choice(word_dict[chain[-1]]))
return ' '.join(chain)
chain
– a variable for the link, where our first word is stored, converted into a link
In fact, that's all, you can already run it by simply calling the function and giving it some txt file. Here's the whole code:
import numpy as np
def ChainMarkovForFiles(text):
text = open(text, encoding='utf8').read()
n_words = 100
corpus = text.split()
def make_pairs(corpus):
for i in range(len(corpus) - 1):
yield corpus[i], corpus[i + 1]
pairs = make_pairs(corpus)
word_dict = {}
for word_1, word_2 in pairs:
if word_1 in word_dict.keys():
word_dict[word_1].append(word_2)
else:
word_dict[word_1] = [word_2]
first_word = np.random.choice(corpus)
while first_word.islower():
first_word = np.random.choice(corpus)
chain = [first_word]
for i in range(n_words):
chain.append(np.random.choice(word_dict[chain[-1]]))
return ' '.join(chain)
Next, I wrote a new, but exactly the same function, where the input is text from the keyboard. This is necessary so that the user can send a regular message to the bot and the algorithm can work.
def ChainMarkovForText(input_text):
n_words = 100
corpus = input_text.split()
def make_pairs(corpus):
for i in range(len(corpus) - 1):
yield corpus[i], corpus[i + 1]
pairs = make_pairs(corpus)
word_dict = {}
for word_1, word_2 in pairs:
if word_1 in word_dict.keys():
word_dict[word_1].append(word_2)
else:
word_dict[word_1] = [word_2]
first_word = np.random.choice(corpus)
while first_word.islower():
first_word = np.random.choice(corpus)
chain = [first_word]
for i in range(n_words):
last_word = chain[-1]
if last_word in word_dict: # проверяем, есть ли текущее слово в словаре
next_word = np.random.choice(word_dict[last_word])
chain.append(next_word)
else:
break # если слов больше нет, выходим из цикла
return ' '.join(chain)
Developing a bot
First, you need to register a bot in Telegram and get a token. I won't tell you how to do this, because the article will be insanely long. Here's a link to official guide from Telegram.
I will develop the bot using the library telebot
you also need to download it.
Create a file main.py
and write the following code:
import telebot
import os
import ChainMarkov
bot = telebot.TeleBot('ВАШ_ТОКЕН')
os
– a system library that is needed to install the fileChainMarkov
– this is our previous file that we created beforebot
– variable where our token will be stored. Paste it there
Next, we'll write a decorator that processes the /start command and outputs something like: hello, waiting for a message or file from you
@bot.message_handler(commands=['start'])
def main(message):
bot.send_message(message.chat.id, 'Привет, жду от тебя сообщение или txt файлик, '
'на основе которого я тебе вышлю бездумный текст, '
'используя Цепи Маркова')
Let's create another decorator for the /help command for the sake of decency
@bot.message_handler(commands=['help'])
def main(message):
bot.send_message(message.chat.id, 'Итак, просто напоминаю, что я создан для того, '
'чтобы придумывать бессмысленный текст. Я жду от тебя любое сообщение '
'от 100 слов до 4096 символов или же обычный txt файлик')
Let's check for the message type, so that for all message types except text and file
@bot.message_handler(content_types=['sticker', 'photo', 'animation', 'video', 'image', 'voice'])
def check(message):
bot.send_message(message.chat.id, 'Я хочу видеть настоящий текст от 100 слов до 4096 символов!')
Let's create a function that will process the user's message and use the chain algorithm. I use an exception handler just in case.
@bot.message_handler(content_types=['text'])
def text_processing(message):
try:
word_count = len(message.text.split())
if word_count <= 100:
bot.send_message(message.chat.id, 'Очень мало слов, нужно больше. Давайте от 100 слов')
else:
generated_text = ChainMarkov.ChainMarkovForText(message.text)
bot.send_message(message.chat.id, f'Ваш сгенерированный текст:\n\n{generated_text}')
except:
bot.send_message(message.chat.id, 'В общем, я пришел к выводу, что то пошло не так')
word_count
– word counter from user's messagegenerated_text
– our generated text, which we will ultimately send to the user
It will be a bit more complicated with the file processing function. We will need to extract the file, install it, generate the text and delete the file. I made a check to make sure that it was a txt file.
@bot.message_handler(content_types=['document'])
def file_processing(message):
try:
file_info = bot.get_file(message.document.file_id)
if message.document.file_name.endswith('.txt'):
downloaded_file = bot.download_file(file_info.file_path)
with open('temp.txt', 'wb') as new_file:
new_file.write(downloaded_file)
generated_text = ChainMarkov.ChainMarkovForFiles("temp.txt")
bot.send_message(message.chat.id, f'Ваш сгенерированный текст:\n\n{generated_text}')
# Удаляем временный файл
os.remove('temp.txt')
else:
bot.send_message(message.chat.id, 'Это не txt файлик')
except:
bot.send_message(message.chat.id, 'В общем, я пришел к выводу, что то пошло не так')
A small code to prevent the bot from turning off
bot.infinity_polling()
Final code
import telebot
import os
import ChainMarkov
bot = telebot.TeleBot('7539804440:AAHNk3lZDvXnUcG9Me3LuLwNlnl2av9YXYM')
@bot.message_handler(content_types=['sticker', 'photo', 'animation', 'video', 'image', 'voice'])
def check(message):
bot.send_message(message.chat.id, 'Я хочу видеть настоящий текст от 100 слов до 4096 символов!')
@bot.message_handler(commands=['start'])
def main(message):
bot.send_message(message.chat.id, 'Привет, жду от тебя сообщение или txt файлик, '
'на основе которого я тебе вышлю бездумный текст, '
'используя Цепи Маркова')
@bot.message_handler(commands=['help'])
def main(message):
bot.send_message(message.chat.id, 'Итак, просто напоминаю, что я создан для того, '
'чтобы придумывать бессмысленный текст. Я жду от тебя любое сообщение '
'от 100 слов до 4096 символов или же обычный txt файлик')
@bot.message_handler(content_types=['text'])
def text_processing(message):
try:
word_count = len(message.text.split())
if word_count <= 100:
bot.send_message(message.chat.id, 'Очень мало слов, нужно больше. Давайте от 100 слов')
else:
generated_text = ChainMarkov.ChainMarkovForText(message.text)
bot.send_message(message.chat.id, f'Ваш сгенерированный текст:\n\n{generated_text}')
except:
bot.send_message(message.chat.id, 'В общем, я пришел к выводу, что то пошло не так')
@bot.message_handler(content_types=['document'])
def file_processing(message):
try:
file_info = bot.get_file(message.document.file_id)
if message.document.file_name.endswith('.txt'):
downloaded_file = bot.download_file(file_info.file_path)
with open('temp.txt', 'wb') as new_file:
new_file.write(downloaded_file)
generated_text = ChainMarkov.ChainMarkovForFiles("temp.txt")
bot.send_message(message.chat.id, f'Ваш сгенерированный текст:\n\n{generated_text}')
os.remove('temp.txt')
else:
bot.send_message(message.chat.id, 'Это не txt файлик')
except:
bot.send_message(message.chat.id, 'В общем, я пришел к выводу, что то пошло не так')
bot.infinity_polling()
Conclusion
That's all for today. We met and I tried to tell and show something interesting. I hope we won't finish here.
I don't know about self-promotion here, but here's mine tg channel
And this is the list literature which I used, there were 9 articles.
P.S.
I can’t answer about originality, but I don’t think I found anything like that in Russian, or maybe I didn’t search well enough.