Training the text generation algorithm based on the statements of philosophers and writers

Surely you dreamed of talking with the great philosopher: ask him a question about your life, get his opinion or just chat. Nowadays, this is possible due to chat bots that maintain a dialogue, imitating the manner of communication of a living person. Such chatbots are created using natural language development and text generation technologies. Already, there are trained models that do a good job of this task.

In this article, we will talk about our experience in learning a text generation algorithm based on the statements of great personalities. In the dataset for training the model, we have collected quotes from ten famous philosophers, writers and scientists. The final text will be generated on the statements of ten different thinkers. And if you want to “talk” with someone specific – for example, with Socrates or Nietzsche, then the laptop in which the work was carried out is attached at the end. With it, you can only experiment with the sentences of your chosen philosopher.

Data collection

Work plan

Models like \textbf{LSTM}, \textbf{GRU} do a good job of generating text, but Transformers often work best. They get a more meaningful and understandable text for a person. The problem is that Transformers are very “heavyweight”, and they need to be trained for a long time.

We have a dataset of quotes from ten philosophers, on which we need to train the model so that it generates texts similar to the real statements of these people. The question is, which model to take, and how to process data for it? How to evaluate the result of the model, how to train it?

As real researchers-practitioners, we will start with a simpler model – \textbf{LSTM}. Suddenly, she will immediately show a decent result, and we will not have to use Transformers.

Let’s look at the results of already pre-trained models and retrain them on our dataset. As a loss, we will use a simple cross-entropy, and as a quality metric, we will take a simple accuracy, in this case it works quite well. After that, we will try to generate an ending for a simple sentence, and evaluate from a human point of view how well the model composes the text. After all, a good metric is, of course, wonderful, but if the generated sentences do not make any sense, then what is the use of them?

Working with RNN

Data processing

For \textbf{LSTM}, let’s combine the data from the dataset and look at the sentence, tokenize it. In the laptop itself, you can see that it turned out to be too many tokens – 8760:

What is the problem with a large number of tokens? The fact is that when predicting and generating the next word in the text, the model calculates for each word the probability that it will be the next in the sentence. It turns out that if there are a lot of words, then the number of probabilities will be appropriate. We calculate the probability using softmax, but it is difficult to take softmax from a vector of such a large dimension. There are several ways to solve the problem:

  • Drop rare words (tokens), and replace them all with the corresponding token \<UNK\>” src=”” width=”76″ height=”22″/>An assigned to all tokens that are not in the dictionary.  It is clear that this is a very simple and not very efficient option, since we are greatly reducing the dictionary.</p></li><li><p>Use characters as tokens, not words.  This simple approach has a clear drawback – a model trained on individual characters is much more difficult to learn the relationships between letters in each individual word.  Because of this, a situation may arise in which a significant part of the words generated by the model will not be the words of the language in which the input texts are written.</p></li><li><p>Use \href{<a rel=}{Byte Pair Encoding}. A more modern and powerful method implemented by the guys from VK. The attached notebook briefly describes how this method works, as well as links to related articles and GitHub. The idea of ​​solving the problem is that we find frequent n-grams and replace them with some character, thereby encoding them. It is obvious that in voluminous texts we can find a huge number of such substitutions.

Here is a small explanatory picture:

Of course, we will choose the last method of working with data.

Model Training

The training of the model itself will take place in the standard way. To do this, we need to apply a classification neural network to each hidden state of the RNN – most likely, a linear one. At each time step, there should be a different probability prediction for the next token.

After that, it is necessary to average the loss over the received predictions. In short, this will happen as follows: the input comes in the sequence x_1, x_2, ..., x_{n}. Required at every stage of time t by symbols x_1, ..., x_{t - 1} predict symbol x_t. To do this, the input must be the sequence \<BOS\>, x_1, x_2…, x_{n}” src=”” width=”179″ height=”22″/> and take the sequence as targets <img class=

Also, for further work, it is necessary to equalize all proposals in length. The idea of ​​solving this problem is simple: select the desired sentence length, and add a special token for the sentences missing in length \<PAD\>,” src=”” width=”78″ height=”22″/> meaning that the offer has expanded, and it will not be necessary to run a gradient on such tokens after embedding.  If the sentences are too long, we will simply cut them off.  All the subtleties of implementation can be viewed in a laptop.</p><p>The following architecture was taken as a model:</p><figure class=

As the first layer, embedding is used, which will translate the tokens into a machine-friendly form. We specifically pass the index of the token \<PAD\>,” src=”” width=”78″ height=”22″/> so that, as already noted, the gradient does not flow through such words.  Convolution and normalization layers are used as a feature extractor – some experiments with the model have shown that this gives a good result.  Next comes the recurrent network itself, and at the end – a linear layer in the form of a classifier.</p><p>As a result, after 30 epochs of training, the following results were obtained:</p><figure class=

Pretty good metric. It is interesting to look at the text generation itself.

Text generation

There are several ways to generate text. Here is some of them:

  • Greedy text generation: at each step, we take the token with the highest probability. There are obvious problems with this approach: by choosing the most likely token, we can lose the meaning of the sentence, because the most likely word is not always the most appropriate one.

  • Top k sampling: to predict the next token, let’s look at the probability distribution. We select k tokens with the maximum probability, and from them we sample one token with a probability proportional to the probability predicted for this token. It is clear that this partially solves the problem of greedy generation, but not completely.

  • beam search: we look at several options for constructing a sentence, and at some points in time we cut off sentences with the lowest overall probability, and so we grow. At the end, we take the offer with the highest probability. Here is a picture explaining this method:

In this case, we will use the first two methods because of the simplicity of their implementation.

Let’s find out what our model produces in the case of greedy generation. As a test, let’s take the beginning of the sentence in the form: “There is beauty in everything, but.” Let’s see how the model continues the statement. The results are as follows:

It can be seen that the text turned out to be incoherent and not meaningful.

Let’s try to continue with the same sentence with the \textit{Top k sampling} generation method. The results are as follows:

It can be seen that the text similarly turned out to be meaningless.

Working with Transformers

RNN did not do well, so let’s try more heavyweight solutions, namely Transformers. They almost always give a good result, especially in this problem. Why, in this case, did we not immediately use this model? The fact is that for all the power of this model, it needs to be trained for a long time. Even for one epoch, some models learn for hours. Therefore, already trained models are more often used, and for specific cases they are additionally trained. So we will do it, we will take the model from the famous site \href{}{hugging face}. After a short search, we can find the following model, which is well suited for our task:

This is a trained model from Sberbank, based on the well-known Transformer \textbf{GPT-2}.

Checking the pre-trained model

Transformers are famous for working well right out of the box. Therefore, let’s immediately see how this model complements the proposal from the previous paragraph:

The text looks much more meaningful, you can even read your heart to your lady. But still, it is interesting: if we train the model on our dataset, will it be able to model the correct statements of great people?

Data processing

Data processing largely repeats the methods used in working with \textbf{RNN}. We add tokens of the end and the beginning of the sentence. We bring all sentences to the same length using trimming, or additions using the same symbol \<PAD\>.” src=”” width=”78″ height=”22″/> Only in this case there is a slight difference, namely, that in order to process these symbols, the Transformer must also pass the so-called “mask”, in which there are zeros where the gradient should not flow.  That is, where the tokens are <img class=

It turned out very similar to the statement of a great man with the name “I don’t know.”

In general, the result is not bad.


Unfortunately, although RNNs are not particularly heavyweight models, they perform worse than Transformers. We realized that the pre-trained Transformers model the language of people well, and can easily continue your thought, as if it were the thought of a great philosopher.


The laptop in which the work was carried out URL: \url{};

Similar Posts

Leave a Reply