Translation of a pre-trained Keras model to matrix calculations

What is the article about

By order of one of the projects, I needed to make a news aggregator in Telegram. There is a list of news portals from which it was required to collect news; after that, it is necessary to filter the news by relevance: remove advertising messages, including those that, for various reasons, did not meet the requirements. It was impossible to formulate the exact criteria for “bad” news, but they were marked (“by natural intelligence”, i.e. by a person) according to the criterion: 0 – “good”, 1 – “bad”. Continuous manual filtration is a very time-consuming process. Therefore, the idea of implementing automatic filtering based on machine learning arose.

After a long search for implementation (about them below in the article), a neural network was created based on Keras, which was of high quality, but it turned out that Keras could not be installed on the infrastructure (there was simply no corresponding assembly) and I had to decide how to transfer the trained model in Keras to an implementation that does not require Keras installed. I did not find the relevant material on the Internet (except that here the author did something similar, only for LTSM), so I did it myself.

This article is about how I rewrote a network trained in Keras to work with matrix operations in Python Numpy. At the same time, it helped me to “look under the hood” of the neural network.

Separately, I would like to note that the code is simplified for clarity, but in general it is fully functional.

A little bit about yourself

I think it is important to say here that by the time the task arrived, I had no experience in data science. There was an amateur experience in creating Telegram bots in Python. I just began to look on the Internet how others do text classification, and the classic page from sklearn. The first “commercial” version was made according to this principle.

And most importantly, the successful experience of solving this issue led me to the fact that I decided to try to change my specialty (in which I have 20 years of experience), become a professional data science and after a couple of years I passed the corresponding specialized training.

Choosing a Text Classification Model

The purpose of this article is not to analyze the methods of text classification, but without a description of the path that was taken to create a working model, the article would be incomplete.

The task from the customer is a typical task of text classification; and since we have only two categories (bad / good news) this is a conditional subtype of classification, often referred to as “binary classification”.

Classically in machine learning problems (see. CRISP-DM) the initial data must be: 1) pre-processed, 2) prepare the parameters, including the target, 3) train the model (and most likely return to the first stage again to improve the quality of the model).

In the models used in the project (except, perhaps, BERT), it is necessary to lemmatize (and, possibly, stemming) the text, remove stop words, clean the text from html tags and various “garbage” (after all, news is collected from different sites ). My project uses pymorphy2 for lemmatization, regular expressions for filtering everything except text. There is no information about this in this article – there is plenty of detailed material on the Web.

By the way, most news sites did not have an RSS version (they are “local” – they may not understand this very well) and I had to actively use beautifulsoup for parsing html versions of sites and extracting news from there. (I wonder how big aggregators, such as Google, Yandex, solve this? Do they write their own parser for each site?)

We have an unbalanced sample – news with a positive class make up 30% of the entire sample, and therefore it is reasonable to apply some kind of equalization method during training. I used “upsampling” (duplicate news with a positive class) and clearly saw that this simple method significantly improves the quality of the model.

Implementation based on TF-IDF and sklearn models

In the first and long time working implementation of the classifier, I used TF-IDF to create a vector representation of the text, namely TfidfVectorizer with the parameter max_features = 20, selected empirically.

You can write a lot about metrics; the article helped me a lot “Metrics in Machine Learning Problems”and for this article, I’ll use F1, which generally more accurately estimates the model’s ability to distinguish between classes.

After many stages of preprocessing, selection and training of models, the following results were obtained on the validation set (see table below).

Model	F1 score
Random Forest	0.82
SGDClassifier	0.82
LogisticRegression	0.81
MultinomialNB	0.69
KneighborsClassifier	0.80
LGBMClassifier*	0.82

– this is, of course, a model not from sklearn, but from LightGBM – it is given for quality comparison

Which conclusions can be done after analyzing models based on TF-IDF?

All models in general have the same quality
For a practical task, you can choose any of them. I ended up choosing SGDClassifier

But the search for a better model did not stop.

Models based on BERT

The study also tested a BERT-based model (including fine-tunning of the last layer). The implementation used the version from the following source – robert-tiny2, pre-trained on a large number of texts in Russian.

Those. Using BERT, vectorization of the source text was carried out without cleaning (BERT is loyal to the raw text – it is trained on just such), and based on these data, the models that were previously used for TF-IDF were trained.

The F1 metric in this case became equal to 0.87 – significantly higher than with TF-IDF, but it was not possible to deploy BERT on the production environment – there is no corresponding version.

It was decided to choose a model based on a neural network

Neural network classifier

After studying the materials, testing various options and configurations, a fairly simple neural network model was chosen, which proved to be good in terms of quality / resources ratio. The F1 metric for it is 0.88, which is higher than previously obtained.

The model diagram is shown in Figure 1

Figure 1. Neural network model for text classification

In code form, this model looks like this:

vocab_size = 1000 # количество уникальных слов в словаре
embedding_dim = 40 # число параметров после эмбеддинга
max_length = 100 # максимальная длина новости

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss="binary_crossentropy",optimizer="adam",
              metrics=[tf.metrics.BinaryAccuracy(threshold=0.5)])

num_epochs = 10
history=model.fit(features_train, 
                  training_labels_final, 
                  epochs=num_epochs, 
                  validation_data=(features_valid, testing_labels_final))

So we picked up, trained the model and got the right quality. But there is no tensorflow on the target system, but there is standard python 3 and, at most, the numpy library; those. we can’t just save the model and implement the prediction like

predictions = model.predict(news)

It is necessary to transfer this model to the usual “matrix calculations”, which requires the following steps:

get the weights of the trained model,
understand how each stage works,
create a code to calculate the prediction.

Obtaining Weights and Bias of a Trained Model

You can get the weights of the i-th layer using the following command:

weights = model.layers[i].get_weights()[0]

The bias (bias), if it exists in this layer, is obtained using the command:

bias = model.layers[i].get_weights()[1]

Checking the Trained Model

There is a very useful tool for self-control in the implementation of the model. You can run the trained model on some sample and see what intermediate values it (the model) calculates at each stage.

from tensorflow import keras
from tensorflow.keras import layers

extractor = keras.Model(inputs=model.inputs,
                        outputs=[layer.output for layer in model.layers])

features = extractor( features_valid[0].numpy().reshape(-1,100))
print(features)

The output is something like this:

[<tf.Tensor: shape=(1, 100, 20), dtype=float32, numpy=
array([[[-0.3023754 ,  0.0460441 , -0.03640036, ...,  0.14973998,
          0.04820368, -0.16159618],
        [-0.16039295,  0.25132295, -0.13751882, ...,  0.16573162,
         -0.15154448, -0.0574923 ],
        [-0.3023754 ,  0.0460441 , -0.03640036, ...,  0.14973998,
          0.04820368, -0.16159618],
        ...,
        [-0.3023754 ,  0.0460441 , -0.03640036, ...,  0.14973998,
          0.04820368, -0.16159618],
        [-0.22955681, -0.08269349,  0.13517892, ...,  0.00153243,
          0.13046908, -0.16767927],
        [-0.3023754 ,  0.0460441 , -0.03640036, ...,  0.14973998,
          0.04820368, -0.16159618]]], dtype=float32)>, <tf.Tensor: shape=(1, 20), dtype=float32, numpy=
array([[-0.18203291,  0.11690798, -0.08938053,  0.10450792, -0.09504858,
        -0.08279163,  0.29856998, -0.23120254, -0.2559827 , -0.12028799,
         0.00566523, -0.06708373,  0.05338131, -0.15103005,  0.08447236,
         0.10225956, -0.33394486,  0.15348543, -0.04525973, -0.07986856]],
      dtype=float32)>, <tf.Tensor: shape=(1, 6), dtype=float32, numpy=
array([[1.9048874 , 0.07643622, 1.4660159 , 1.907875  , 0.02882011,
        0.        ]], dtype=float32)>, <tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[0.0283242]], dtype=float32)>]

Those. the code displays the results of each layer – you can check whether we have implemented the calculations correctly.

TextVectorization

TextVectorization is the tf.keras.layers layer that converts text to numeric tensors. It can perform text standardization, tokenization and vectorization. It can also create a dictionary of frequently occurring words and map them to integer indices:

In my implementation, it does the following (see Figure 2):

Assigns a numeric identifier (from 2 to the number of unique words) to all unique words in the text corpus. As a hyperparameter max_tokens we specify the maximum number of unique words – all other words will be denoted by one.
Converts the text to vector form, in which each word corresponds to the numeric identifier selected in the previous step. At the same time, it limits the maximum length of the text specified by the parameter output_sequence_length. Conversely, if the text is shorter than the maximum, it will be padded with zeros

Figure 2. How the TextVectorization module works

It is important to note separately that texts must be preprocessed (lemmatization and cleaning) before loading them into TextVectorization

TextVectorization with an example (but remember that we will need to do this without using Keras)

import tensorflow as tf

# определяем TextVectorization. Максимальное кол-во уникальных слов 10, максимальная длина теста - 8 слов
vectorize_layer = tf.keras.layers.TextVectorization(
#     standardize=custom_standardization,
    max_tokens=10,
    output_mode="int",
    output_sequence_length=8)


test_texts=["chatgpt чатбот с искусственный интеллект разработать компания openai и способен работать в диалоговый режим",
            "чатбот нет аналоги в россия разработка"]

vectorize_layer.adapt(test_texts)

features_train = vectorize_layer(test_texts)

print("Преобразованная выборка:", features_train)

print("Словарь. Индекс слова в словаре и есть его числовой идентификатор:", vectorize_layer.get_vocabulary())

Преобразованная выборка: tf.Tensor(
[[1 2 5 1 1 9 1 1]
 [2 1 1 3 6 8 0 0]], shape=(2, 8), dtype=int64)
Словарь. Индекс слова в словаре и есть его числовой идентификатор: ['', '[UNK]', 'чатбот', 'в', 'способен', 'с', 'россия', 'режим', 'разработка', 'разработать']

In a real task, the size of the dictionary is 1000 words, the maximum length of the text is 100 words (the average length of the news text in the sample is 187 words – let’s reduce the text a little)

The “matrix implementation” of TextVectorizaion I made is the following:

zero_line=[1]+ [0] * (vocab_size-1) # нулевой вектор для заполнения матрицы до нужного размера 

def text_to_numbers(text):
      out = []
      for word in text.split()[:max_length]:
          # создаем вектро размера словаря из нулей
          line = [0] * vocab_size 

          # на месте с индексом слова ставим единицу
          line[vocal_dict.get(word, 1)] = 1 
          out.append(line)
      
      # если текст короче максимального, дополняем нулевыми векторами
      out += [zero_line] * (max_length - len(out))
      return np.array(out)

Next, there will be a description of the embedding layer, but here it should be noted that I moved the conversion to a sparse matrix to the text_to_numbers module. Those. to simplify the code, I combined the TextVectorization and partially Embedding functions (in terms of creating a sparse matrix)

Neural network layers

The neural network model itself consists of the following layers:

Embedding layer

In general, embeddings are needed in order to represent categorical features as numerical vectors of smaller dimensions. This allows you to improve the performance and accuracy of machine learning models, as well as extract the semantic and syntactic properties of language entities.

The Embedding layer is the tf.keras.layers layer that converts integer sequences to dense vectors. The output of the Embedding layer is a 3D tensor with shape (batch_size, output_sequence_length, embedding_dim). Unlike popular pre-trained embeddings, in our case, it is trained during neural network training (back-propagation).

The layer receives as input a selection of numerical indices (see Fig. 2) prepared by TextVectorization, and then converts them into vectors of the form:

[
[0, 0, ….., 1, 0, …, 0]
[0, 0, ….., 0, 1, …, 0]
…
[1, 0, ….., 0, 0, …, 0],
[0, 0, ….., 0, 1, …, 0]
[0, 0, ….., 1, 0, …, 0]
…
[0, 0, ….., 0, 0, …, 1],
]

In such a representation, each word corresponds to a vector: it consists of zeros, with the exception of the only one, which stands still, the index of which is equal to the numeric representation of the word (recall that in my implementation this is placed in the function text_to_numbers)

Thus, if the input was a matrix of size (100) – 100 numeric indices corresponding to words), then it is converted into a matrix of (100, 1000) – 100 vectors, each consisting of 1000 elements – zeros and ones.

In the resulting model, the size of the vector is 40 elements (this value was selected empirically) and after the embedding layer there will be a matrix (100, 40).

“Matrix implementation” of the Embedding layer:

# на вход поступает разреженная матрица
def embedding(data):
        emb_out = []
        for char_hot in data:
            emb_out.append(np.dot(char_hot, emb_weights ))
        emb_out = np.array(emb_out)
        return np.array(emb_out)

There is one unpleasant feature of this code: the matrices obtained for calculation are sparse, so their multiplication takes significant computational resources. In my project it doesn’t matter, because. several dozen news items are checked simultaneously. But if you need to process several thousand news, then this can take a long time (on a core i5 2500K, with 8GB of RAM without a GPU, 2000 news are processed for about 3 minutes). In order to get around this limitation, you can use libraries that implement sparse matrix operations: for example, scipy.

GlobalAveragePooling1D layer

This layer simply averages the values of the matrix: its input is a matrix (100, 40), and its output is a vector of 40 elements.

The code for its implementation is the following:

def avarage(data):
        av_out = np.mean(data, axis=0)
        return av_out

Dense layer.

Here, too, it is quite simple – this is essentially the multiplication of the input vector by the weights in “neurons”. In our case, there are 6 of them and the weights in this layer have the size (40, 6) = (embedding size, number of neurons).

def ReLU(x):
        return x * (x > 0)

def dense_6(data):
        dense_6_out = ReLU(np.dot(data, dense_6_weights) + dense_6_bias)
        return dense_6_out

The output is a vector of 6 elements.

output layer

Next, we multiply the resulting vector by the weight vector of the output layer (plus the offset) and apply the sigmoid.

def sigmoid(data):
    return 1 / (1 + np.exp((-1) * data))

def dense_out(data):
  _dense_out = sigmoid(np.dot(data, weights_out) + self.bias_out[0])
  return _dense_out

The prediction is ready!

The complete predictor class looks like this

class Predictor:
    def __init__(self, emb_weights, dense_6_weights, dense_6_bias,
                 weights_out, bias_out, vocal_dict, vocab_size,
                 max_length, show_intermediate_data=False):

        self.emb_weights = emb_weights
        self.dense_6_weights = dense_6_weights
        self.dense_6_bias = dense_6_bias
        self.weights_out = weights_out
        self.bias_out = bias_out
        self.max_length = max_length
        self.vocab_size = vocab_size
        self.show_data = show_intermediate_data
        self.vocal_dict = {vocal_dict[k]: k for k in range(self.vocab_size)} 
        self.zero_line=[1]+ [0] * (self.vocab_size-1)
       
    def text_to_numbers(self, text):
        out = []
        for word in text.split()[:self.max_length]:
            line = [0] * self.vocab_size
            line[self.vocal_dict.get(word, 1)] = 1
            out.append(line)
        
        out += [self.zero_line] * (self.max_length - len(out))

        return np.array(out)


    def predict(self, x):
        results = []
        for sentanence in x:
            emb_out = self.embedding(self.text_to_numbers(sentanence))
            out_avarage = self.avarage(emb_out)
            out_dense_6 = self.dense_6(out_avarage)
            results.append(self.dense_out(out_dense_6))
        return np.array(results)

    def embedding(self, data):
        emb_out = []
        for char_hot in data:
            emb_out.append(np.dot(char_hot, self.emb_weights ))
        emb_out = np.array(emb_out)

        if self.show_data:
            print(f'embedding out:{emb_out}')

        return np.array(emb_out)

    def avarage(self, data):
        av_out = np.mean(data, axis=0)

        if self.show_data:
            print(f'avarage out:{av_out}')

        return av_out

    def dense_6(self, data):
        dense_6_out = self.ReLU(np.dot(data, self.dense_6_weights) + self.dense_6_bias)

        if self.show_data:
            print(f'Dense 6 out:{dense_6_out}')

        return dense_6_out

    def dense_out(self, data):
        _dense_out = self.sigmoid(np.dot(data, self.weights_out) + self.bias_out[0])

        if self.show_data:
            print(f'Final out:{_dense_out}')

        return _dense_out


    def ReLU(self, x):
        return x * (x > 0)

    def sigmoid(self, data):
        return 1 / (1 + np.exp((-1) * data))

config_dict={
    
    'emb_weights':model.layers[0].get_weights()[0].tolist(),
    'dense_6_weights':model.layers[2].get_weights()[0].tolist(),
    'dense_6_bias': model.layers[2].get_weights()[1].tolist(),
    'weights_out':model.layers[3].get_weights()[0].tolist(),
    'bias_out':model.layers[3].get_weights()[1].tolist(),
    "vocab_size":vocab_size,
    "max_length":max_length,
    "vocal_dict":vectorize_layer.get_vocabulary()
    
}

# Использование
predictor=Predictor(**config_dict, show_intermediate_data=False) 

prediction = predictor.predict(testing_sentences)

Saving (after training) and loading (on the production environment) configuration I do with the following code

import json

# сохранить  config
with open('./data/config.json', 'w') as fp:
    json.dump(config_dict, fp)

import json

# загрузить config
with open('./data/config.json', 'r') as fp:
    config_dict = json.load(fp)
config_dict

Conclusion

As a result, I was able to convert the trained Keras model to work with the “standard” Python libraries without having to install Keras/Tensorflow on the production environment. This allowed it to be used in a production environment.

If layers are added to the model, it is possible to supplement the predictor class with additional calculations.

Translation of a pre-trained Keras model to matrix calculations

What is the article about

Choosing a Text Classification Model

Implementation based on TF-IDF and sklearn models

Models based on BERT

Neural network classifier

Obtaining Weights and Bias of a Trained Model

Checking the Trained Model

TextVectorization

Neural network layers

Conclusion

How to prepare a product backlog with a lot of dependencies without wasting time

How we abandoned the use of the Styled-System to create components and invented our own bike

Apache & Nginx. Tied by one chain (part 2)

Rocket from Amperka, part 6: test sticks, perchlorate fuel, burn rate test bench

The development team proposes to switch to UTF-8

How did Twitter slow down? What is DPI? Parsing

Leave a Reply Cancel reply

What is the article about

Choosing a Text Classification Model

Implementation based on TF-IDF and sklearn models

Models based on BERT

Neural network classifier

Obtaining Weights and Bias of a Trained Model

Checking the Trained Model

TextVectorization

Neural network layers

Conclusion

Similar Posts

Leave a Reply Cancel reply