CBOW architecture learning algorithm for word vectorization
This article analyzes in detail the learning algorithm of the CBOW (Continuous Bag of Words) architecture, which appeared in 2013 and gave a strong impetus to solving the problem of vector representation of words, because for the first time in practice, a neural network approach was used. The CBOW architecture is not as demanding on the presence of a GPU and can be trained on the CPU (albeit more slowly). Large off-the-shelf models trained on wikipedia or news bulletins may well run on a 4-core processor, showing an acceptable response time.
CBOW architecture
The CBOW architecture was proposed by a group of researchers from Google in 2013 in conjunction with the Skip-gram model. The logic of the CBOW architecture is very simple: predict a word depending on the context in which the word is found. An example of a neural network architecture in the simplest implementation of CBOW using just one word (context) before the target word is shown in Figure 1. Later in the article, we will move from a simple model based on one word to a larger context.

The neural network architecture consists of a single layer of text corpus dimension, in other words, vocabulary length. The hidden layer has the dimension N. The output layer has the same dimension as the input layer. The activation function of the hidden layer is linear, and in the output layer it is a generalized logistic activation function:

The weight matrix between the input layer and the hidden layer is denoted by the letter W and has the dimension VxN. We denote the weights between the hidden layer and the output layer W’they will have dimension NxV. It is worth noting that before training it is necessary to vectorize the words. This can be done using a simple unitary code or, for example, a TFIDF statistical measure.
Consider how the output value will be calculated based on the architecture shown in Figure 1. First, the weights are initialized using the standard normal distribution. Then the values in the hidden layer are calculated using the formula:

In the same way, values are calculated after passing through the hidden layer, but before entering the activation function of the output layer.
The choice of a loss function for evaluating the output layer of a neural network is based on obtaining a range of values in the range from 0 to 1. At the same time, given that we have a logistic activation function at the output, which in this problem can be interpreted as a conditional probability of a word appearing after a certain context, we can define the loss function as follows:

where wt – target word wc – contextual word.
With a little transformation, you can get a simpler notation:

To solve the problem, we need to minimize this loss function and maximize the probability that the model predicts the target word given the context (consisting of one word at this stage).
Learning CBOW with one word in context
Having obtained the expression of the loss function, it is necessary to find the values that minimize it W and W’. The problem of neural network optimization is most often solved using gradient descent. In order to implement the backpropagation algorithm based on gradient descent, it is necessary to find the partial derivatives

It is easiest to establish a relationship between W, W’ and the loss function in terms of the vector

that is, we can write the following relation:

Thus, for partial derivatives it will be true:


These ratios are sufficient for learning the CBOW architecture.
Consider the possibilities of simplifying the model. Note that the weight W’ijwhich is an element W’connects the node i hidden layer with node j output layer. That is W’ij affects only the score. uj (as well as yj) as shown in Figure 2. This is due to the fact that in the simplified model (when using one word in context) the vector xfed to the input of the neural network, will contain zeros with x0 before xV inclusive except for the dot xk, where the unit will stand. That is, after passing through the activation function, all weights W’ij with the exception of hjwill be equal to 0. This assumption is valid in the case of using vectorization based on a unitary code.

Based on the foregoing, it can be assumed that the partial derivative

is zero everywhere except k = j. Then we can rewrite the derivative as follows:

Let’s transform now

how

where

– the Kronecker delta is a function of 2 variables, which is equal to 1 when , and in other cases is equal to 0. For the second part of the equation we have:

Then, after combining the left and right parts, we get:

Similarly, you can perform the same operations for the derivative

however, it is worth noting that after fixing the input xk node output j will depend on all matrix elements Was shown in figure 3.

Before converting

can be issued separately uk from vector u:

From this equation one can easily write out the derivative

for the only term which remains after the operation of the sum will be that in which l = i and m = j. In this way,

Putting it all together, we get:

Having carried out all the transformations, we have obtained all the necessary relations for the implementation of the first “training” pass for the gradient descent algorithm. In order to apply the gradient descent algorithm, it is now necessary to choose a learning rate

In general, learning rate is a hyperparameter to any machine learning model or neural networks. Difficulties may arise with the choice of a parameter, namely: when choosing a large parameter, the algorithm can jump over the minimum point of the function or even “get stuck”, jumping from side to side, but never reaching the minimum. In the case of lower values, the learning rate will be very slow. However, it should be noted that in this case it is very difficult to miss the minimum point. It is recommended to use a coefficient in the range

Scale update W and W’ takes place according to the following formula:

One step to find the minimum of even such a simple architecture as CBOW is not enough. Therefore, when training neural networks, there is another important parameter, namely the number of epochs.
An epoch is one complete training cycle, that is, passing through all training examples exactly 1 time. The choice of the number of epochs depends only on the researcher and on how long he intends to wait until his network is trained. If it is possible to select a large number of epochs, the learning coefficient is chosen close to zero.
CBOW training in general
Knowing how the backpropagation learning algorithm works with one word in the context, we can introduce additional complexity by including more context words, as shown in Figure 4. It is worth noting that we can consider words not only behind the target word, but also ahead, that is, in context.

The input data are context words encoded unitarily. Number of context words C depends on our choice (how many context words we want to use). The hidden layer becomes the average value obtained from each context word.
The general CBOW architecture has the same equations as the one-word architecture, but only generalized to the case of using an arbitrary number of context words. C:

The activation function looks exactly the same as when using a single word in context:

The derivatives of the loss function are calculated in the same way as when using a single word in context, except that the input vector

is replaced by the average over the input vector


As you can see, the overall architecture and learning process can easily be derived from a simple model using just one word in context.

The text vectorization algorithm based on the CBOW architecture can also be used in modern systems, because, unlike BERTa and ELMo, it has a higher speed when using the CPU. Especially often such implementations can be found in clients who do not have sufficient computing power. You can read more about one such case in our next article.