# Dispelling Myths About Deep Learning – How Do Neural Networks Learn?

On the eve of the start of the course Deep Learning. Basic “ prepared a translation of interesting material for you.

Deep learning has contributed immensely to the progress and upsurge in artificial intelligence we see around the world today. The tasks that artificial intelligence is now doing, such as text and image classification, instance segmentation, answering questions based on text data, reading comprehension, and more, were science fiction in the past, but are now becoming more useful and increasingly imitated. human due to the use of deep neural networks.

How do neural networks cope with these complex tasks? What happens beneath the endless layers of bits of mathematical operations that fill these networks?

Simple neural network

Let’s dig deeper and conceptually understand the basics of deep neural networks.

First, let’s talk about the algorithm used by most (if not all) neural networks for learning from training data. Training data is nothing more than human annotated data, that is, tagged images in the case of image classification, or tagged sentiments in sentiment analysis.

And it is called the error backpropagation algorithm.
Below is a brief overview of the structure of neural networks:

Neural networks transform input data into output in a certain way. Input data can be images, text fragments, etc. The input data is converted to their numerical representation: for example, in images, each pixel is encoded with a numerical value depending on its position, and in the text, each word is a vector of numbers, which is a vector representation of a word (in such a vector, each number is an estimate of a specific characteristic of a word) or a one-dimensional vector (a vector of dimension n, consisting of n-1 zeros and one one, where the one position will indicate the selected word).

This numeric input is then passed through a neural network (using a technique known as backpropagation of an error), which under the hood has several steps of multiplying by weights in the network, adding offsets, and passing through a nonlinear activation function. This forward propagation step is performed for each input in the labeled training data, and the accuracy of the network is calculated using a function known as a loss function or cost function. The goal of the network is to minimize the loss function, that is, to maximize its accuracy. Initially, the network starts working with a random value of the parameters (weights and biases), and then gradually increases its accuracy and minimizes losses, continuing to improve these parameters at each iteration by direct propagation on the training data. Updating weights and biases (magnitude and positive or negative direction) is determined by the back propagation algorithm. Let’s look at the backpropagation algorithm and understand how it helps neural networks “learn” and minimize the loss of training data.

Forward propagation in a deep neural network

Backpropagation is about figuring out how each parameter should change to better fit the training data (i.e., minimize waste and maximize prediction accuracy). The method for determining these values ​​is quite simple:

In the picture above, the axis Y Is the loss function, and the axis X – some parameter (weight) in the network. The initial value of the weight must be reduced in order to reach the local minimum. But how does the network understand that the weight needs to be reduced in order to do this? The network will rely on the slope of the function at the starting point.

How do you get the slope? If you’ve studied mathematics, you know that the slope of a function at a point is given by its derivative. Voila! Now we can calculate the slope, and therefore the direction of change (positive or negative) of the weight. The weight value is updated iteratively and we end up with the minimum.

The difficulty arises when the weights are not directly related to the loss function, as is the case with deep neural networks. This is where the familiar chain rule comes in.

For example, the picture above shows that the result is Y does not depend directly on the input value X, However X goes through Fand then through G before giving the output value Y… The chain rule can be used to write the derivative of G by Xindicating the dependence G from Fwhere F depends on X… This rule can be applied for networks of any length with the resulting derivative and, therefore, the slope for any output value with respect to the input, obtained as the product of the derivatives of all steps through which the input value passes. This is the essence of backpropagation of the error, where the derivative / slope of the output value with respect to each parameter is obtained by multiplying the derivatives during the backward pass through the network until the direct derivative of the parameter is found, which is why the method is called backpropagation.