FastText is an excellent solution for providing ready-made vector representations of words, for solving various problems in the field of ML and NLP. But the main disadvantage of these models is that at the moment the trained FastText model on the Russian-language Wikipedia corpus of texts occupies a little more than 16 Gigabytes, which significantly narrows the possibilities of using this technology.
In the open spaces of Habr, you will already find examples of such compression described earlier by “David Dale” in the article “How to compress the fastText model 100 times.” Solving this problem, I applied the recommendations from this article, and we will return to them, but this article has already somewhat lost its relevance, since some of the methods used no longer work in the new version of the Gensim 4.0 library. In addition, the one used in this article has a more general nature of application, since the model compressed in this way is still not designed to solve a narrow problem, and as practice has shown when solving narrower problems, the model loses quality more significantly than it is shown in the examples …
In this article I will talk about how I compressed the FastText model to solve a specific, local problem, while the main goal was precisely that the results did not differ from the results of the original FastText model.
The main essence of the method I applied was to exclude unused words from the FastText model vocabulary. Since, for example, the “wiki_ru” model contains in its corpus 1.88 million words in the dictionary, and 2 million n-grams of tokens, (300 dimensional) vectors.
To solve the same local problem, I reduced this number to 80 thousand words, and 100 thousand ng, and thus received an almost 80-fold reduction in the size of the model. At the same time, when solving this problem, I did not want to reduce the dimension of the vectors, and engage in quantization, since this inevitably reduces the quality, due to the loss of a part of the carrier information in vectors from such compression.
So, the first thing that needed to be done was to take a list of all words (tokens) from my training body of text, which I actually did. My training text was stored in the train_input.txt file.
To create my own dictionary, I used the gensim library and the FastText training engine. Yes, this is probably not the best way, but it seemed to me flexible enough to solve exactly this problem, where I could control the size of my resulting dictionary with parameters like min_count.
The next step, I wanted to achieve that the resulting model produces similar query results with similar text words, therefore, in addition to words from my corpus, in the future text corpus of my model, I also added words from TOP 10, similar words using the most_similar method
But this may not seem like enough. After all, we know that the fastText model stores not only words but also n-grams of words. Because the model is capable of breaking words into n-grams and, in its part, stores them in the dictionary as well. Therefore, in the next step, each word from the resulting dictionary, I divided into n-grams of words, and also added them to our resulting dictionary.
Thus, I got a general dictionary, in which 32 thousand words and 50 thousand n-grams of words from the dictionary turned out, which in total amounted to 72 thousand words. However, I decided not to limit myself to this dictionary, and in the end I added another 8000 thousand words of the most common words from the FastText model “wiki_ru”, as recommended from the above article, to make the model more resistant, including to new unknown words.
Further, from the received words, a final dictionary was compiled. At the same time, it was important that the word order did not differ from the basic model. Since the dictionary is compiled in order of frequency of occurrence of words.
After generating the dictionary, an important point in setting up the new FastText model is repackaging the n-gram hash matrices. The method of which was described in the article by Andrey Vasnetsov, in this article… However, this code also had to be slightly modified due to the update of the gensim library.
As a result
As a result of these transformations, I obtained a model that, in terms of its characteristics for solving the task at hand, is not nearly inferior in the quality of the provided results to its parent model. This is very clearly seen when composing most_similar queries.
Obtaining vector representations of words, sentences, and tables of similarity results yielded comparable results.
The code for compressing the model and then applying it is available in my repositories on GitHub.
If you have additional ideas on how to improve the model obtained in this way, write in the comments, I will be very happy.