Introducing the backpropagation method

7 min

Hello! The New Year holidays have come to an end, which means that we are again ready to share useful material with you. A translation of this article was prepared in anticipation of the launch of a new stream at the rate “Algorithms for developers”.


The back propagation method of error is probably the most fundamental component of a neural network. It was first described in the 1960s and almost 30 years later it was popularized by Rumelhart, Hinton and Williams in an article entitled “Learning representations by back-propagating errors”.

The method is used to effectively train a neural network using the so-called chain rule (the rule of differentiation of a complex function). Simply put, after each passage through the network, back propagation performs the passage in the opposite direction and adjusts the parameters of the model (weight and displacement).

In this article, I would like to consider in detail from a mathematical point of view the process of learning and optimizing a simple 4-layer neural network. I believe that this will help the reader understand how backpropagation works, as well as realize its significance.

Define a neural network model

The four-layer neural network consists of four neurons in the input layer, four neurons in the hidden layers and 1 neuron in the output layer.

A simple image of a four-layer neural network.

Input layer

In the figure, the purple neurons represent the input. They can be simple scalar quantities or more complex ones – vectors or multidimensional matrices.

Equation describing inputs xi.

The first set of activations (a) is equal to the input values. “Activation” is the value of a neuron after applying the activation function. See below for more details.

Hidden layers

The final values ​​in the hidden neurons (shown in green) are calculated using zl – weighted inputs in layer I and aI activations in layer L. For layers 2 and 3, the equations will be as follows:

For l = 2:

For l = 3:

W2 and W3 Are weights on layers 2 and 3, and b2 and b3 – displacements on these layers.

Activations a2 and a3 are calculated using the activation function f. For example, this function f is non-linear (as sigmoid, Relu and hyperbolic tangent) and allows the network to learn complex patterns in the data. We won’t dwell on how activation functions work, but if you’re interested, I highly recommend reading this wonderful article.

If you look closely, you will see that all x, z2, a2z3, a3, W1, W2b1 and b2 do not have lower indices shown in the figure of a four-layer neural network. The fact is that we combined all the parameter values ​​into matrices grouped by layers. This is a standard way of working with neural networks, and it is quite comfortable. However, I will go through the equations so that there is no confusion.

Let’s take layer 2 and its parameters as an example. The same operations can be applied to any layer of the neural network.
W1 Is a matrix of weights of dimension (n, m)where n Is the number of output neurons (neurons in the next layer), and m – the number of input neurons (neurons in the previous layer). In our case n = 2 and m = 4.

Here, the first number in the subscript of any of the weights corresponds to the neuron index in the next layer (in our case, this is the second hidden layer), and the second number corresponds to the neuron index in the previous layer (in our case, this is the input layer).

x Is an input vector of dimension (m, 1) where m – the number of input neurons. In our case m = 4.

b1 Is the displacement vector of the dimension (n, 1) where n – the number of neurons in the current layer. In our case n = 2.

Following the equation for z2 we can use the above definitions of W1, x and b1 to obtain the equation z2:

Now carefully look at the illustration of the neural network above:

As you can see, z2 can be expressed through z12 and z22where z12 and z22 – the sum of the products of each input value xi to the corresponding weight Wij1.

This leads to the same equation for z2 and proves that matrix representations z2, a2z3 and a3 – true.

Output layer

The last part of the neural network is the output layer, which gives the predicted value. In our simple example, it is presented in the form of a single neuron stained in blue and calculated as follows:

Again, we use the matrix representation to simplify the equation. You can use the above methods to understand the underlying logic.

Direct distribution and evaluation

The above equations form a direct distribution through the neural network. Here is a quick overview:

(1) – input layer
(2) – the value of the neuron in the first hidden layer
(3) – activation value on the first hidden layer
(4) – the value of the neuron in the second hidden layer
(5) – activation value at the second hidden level
(6) – output layer

The final step in a direct pass is to estimate the predicted output value. s relative to expected output y.

The output y is part of the training data set (x, y), where x – input data (as we recall from the previous section).

Score between s and y occurs through a loss function. It may be as simple as standard error or more complex like cross entropy.

We will call this loss function C and denote it as follows:

Where cost may be equal to standard error, cross entropy, or any other loss function.

Based on the value of C, the model “knows” how much it needs to adjust its parameters in order to get closer to the expected output value y. This happens using the backpropagation method.

Back propagation of error and calculation of gradients

Based on a 1989 article, the backpropagation method:

Constantly adjusts the weights of connections in the network to minimize the measure of the difference between the actual output vector of the network and the desired output vector.
… makes it possible to create useful new functions that distinguish backpropagation from earlier and simpler methods …

In other words, back propagation aims to minimize the loss function by adjusting the weights and offsets of the network. The degree of adjustment is determined by the gradients of the loss function with respect to these parameters.

One question arises: Why calculate gradients?

To answer this question, we first need to revise some concepts of computing:

The gradient of the function C (x1, x2, …, xm) at the point x is called partial derivative vector From to x.

The derivative of function C reflects the sensitivity to a change in the value of a function (output value) relative to a change in its argument x (input value) In other words, the derivative tells us which direction C. is moving.

The gradient shows how much you need to change the parameter. x (positive or negative) to minimize C.

These gradients are calculated using a method called chain the rule.
For one weight (wjk)l the gradient is:

(1) Chain rule
(2) By definition, m is the number of neurons per l – 1 layer
(3) Derivative calculation
(4) Final value
A similar set of equations can be applied to (bj)l

(1) Chain rule
(2) Derivative calculation
(3) Final value

The common part in both equations is often called the “local gradient” and is expressed as follows:

A “local gradient” can be easily determined using a chain rule. I will not paint this process now.

Gradients allow optimizing model parameters:

Until the stop criterion is reached, the following is performed:

Algorithm for optimizing weights and offsets (also called gradient descent)

  • Initial values w and b randomly selected.
  • Epsilon (e) is the speed of learning. It determines the effect of the gradient.
  • w and b – matrix representations of weights and offsets.
  • Derivative C to w or b can be calculated using partial derivatives of C for individual weights or offsets.
  • The termination condition is satisfied as soon as the loss function is minimized.

I want to devote the final part of this section to a simple example in which we calculate the gradient C with respect to one weight (w22)2.

Let’s zoom in on the bottom of the aforementioned neural network:

Visual representation of backpropagation in a neural network

Weight (w22)2 connects (a2)2 and (z2)2, therefore, the calculation of the gradient requires the application of the chain rule on (z2)3 and (a2)3:

Calculation of the final value of the derivative C with respect to (a2)3 requires knowledge of the function C. Since C depends on (a2)3, the calculation of the derivative should be simple.

I hope this example has managed to shed some light on the math behind calculating gradients. If you want to know more, I highly recommend that you check out the Stanford NLP Series of Articles, where Richard Socher provides 4 great explanations. backpropagation.

Concluding remark

In this article, I explained in detail how back-propagation of an error works under the hood using mathematical methods such as calculating gradients, chain rule, etc. Knowing the mechanisms of this algorithm will strengthen your knowledge of neural networks and allow you to feel comfortable when working with more complex models. Good luck on your deep learning journey!

That’s all. We invite everyone to a free webinar on the topic “Tree segments: simple and fast”.


Leave a Reply