Guide to how convolutional neural networks work

For image and video recognition, a special type of neural network is used – convolutional. For example, they help analyze MRI results and X-rays to make the correct diagnosis.

Together with an expert Maria Zharova We have prepared a detailed guide on how convolutional neural networks work and what you need to know to get started with them.

Maria Zharova

Data Scientist, Alfa-Bank

What are Convolutional Neural Networks

Convolutional neural networks are a type of neural network for processing data with a grid structure: images and videos.

They analyze pixels that are close to each other and contain continuous visual information – brightness and hue. For example, if a neural network sees a flower in one pixel, then it will also recognize it in adjacent pixels.

The idea of ​​creating convolutional neural networks was discussed back in the middle of the 20th century. But they returned to it only in 2012. Then mathematicians Alex Krizhevsky and Jeff Hinton presented the ImageNet neural network at an international competition. Compared to similar models, it made almost 50% fewer errors when recognizing images: their number decreased from 26 to 15%. Now the accuracy has become even higher. For example, when recognizing faces in a crowd, the rate is 99.8%.

Structure of Convolutional Neural Networks

Convolutional neural networks have two main types of layers—convolution and pooling.

Convolutional layer or convolution – key. Here the neural network removes unnecessary things, such as the boundaries of the picture, lines, and leaves only important information that will help it examine and recognize the image. A convolutional layer can be created for any feature: shapes, textures and colors. The neural network itself will select them at each layer.

Convolutional layers use filters, or kernels convolutions. These are predefined matrices of small size, usually 3×3. They are needed for image recognition and feature extraction, such as image edges. How this happens in a few steps:

  1. The filter, like a scanner, moves sequentially across the image.

  2. Each filter is multiplied by the pixels of the image below it. The values ​​of the multiplied pairs are then summed.

  3. The resulting number is written into a new matrix.

  4. The process is repeated for each possible filter position in the image.

  5. The result is a finished matrix or feature map that contains information about how and where certain features are present in the image.

The principle of operation of the convolutional layer, the figure was published on the Medium platform.  Source

The principle of operation of the convolutional layer, the figure was published on the Medium platform. Source

After convolutional comes pooling layer, which helps reduce the size of feature maps. It selects only the most important data and removes the unnecessary. This also reduces the load on the computer during further calculations.

You can then apply the convolutional layer again and repeat the process several times. This way you can gradually identify more complex features in the image and build a hierarchy. For example, first the neural network will find the contours of a flower, and then recognize its shape and shades of the petals.

The result is fully connected layers that use the found features for classification or regression.

The result is fully connected layers that use the found features for classification or regression.

How Convolutional Neural Networks Work

The work of convolutional neural networks can be compared to human visual perception. First we see, for example, a car, and only then we pay attention to what color and size it is. It’s the same with a neural network: first the general is recognized, and then, layer by layer, the particular.

The work of a convolutional neural network can be divided into two stages:

  1. Preparing the image. A convolutional neural network perceives images as three-dimensional arrays of matrices or numbers. Therefore, before entering the model, the data must be prepared. For this purpose, image processing systems are used. The program automatically assigns each pixel a specific value:

    • In black and white pictures – a number from 0 to 255 depending on the saturation.

    • In color images, a three-dimensional matrix of numbers in the form of values ​​from 0 to 255 according to the intensity of red, green and blue colors.

  1. Applying filters to an image. During processing, the filter multiplies the value of the selected pixel and the values ​​of its neighbors in accordance with the filter matrix. The resulting products are then added up. The resulting number replaces the original value of the center pixel.

    For example, if a pixel has a value of 2, and after applying a filter to it and its neighbors, the result is 13, this value will replace the original one. In this case, the values ​​of the remaining neighboring pixels do not change. The same process is repeated for each pixel in the image.

What does the architecture of a convolutional neural network look like?

Let's look at an example of a convolutional neural network architecture in Python using Tensorflow libraries:

# Create a model object
model = models.Sequential()

# First convolutional layer with 32 filters, 3×3 kernel size, ReLU activation function and 32x32x3 input shape
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))

# First subsampling layer (max pooling) with window size 2×2
model.add(layers.MaxPooling2D((2, 2)))

# Second convolutional layer with 64 filters and 3×3 kernel size, ReLU activation function
model.add(layers.Conv2D(64, (3, 3), activation='relu'))

# Second subsampling layer (max pooling) with window size 2×2
model.add(layers.MaxPooling2D((2, 2)))

# Third convolutional layer with 64 filters and 3×3 kernel size, ReLU activation function
model.add(layers.Conv2D(64, (3, 3), activation='relu'))

# Convert data from 2D to 1D for use in fully connected layers
model.add(layers.Flatten())

# Fully connected layer with 64 neurons and ReLU activation function
model.add(layers.Dense(64, activation='relu'))

# Output fully connected layer with 10 neurons (according to the number of classes) and a softmax activation function for classification
model.add(layers.Dense(10, activation='softmax'))

This architecture is standard. Usually it consists of several blocks: convolutional layer + max pooling. At the end, the data is converted into a single vector and the final fully connected layers are obtained.

How convolutional neural networks are trained

Training convolutional models is not conceptually different from training conventional neural networks. Let's look at the general scheme:

  1. “Moving forward”, or forward

    At this stage, the input vectorized data passes through a number of layers in the network. Each layer applies weight — parameters that correct input data and activation functions — mathematical functions that determine the network output at each layer. This is necessary in order to transform the data into more abstract representations that are closer to solving the problem. As a result, the network output is formed, that is final predictions of the model.

  2. Calculation of the loss function, or loss function

    The loss function is a measure of the difference between the received prediction and the true value – the target. For different types of problems, different loss functions are used: standard deviation – measures the root mean square difference between the predicted and true values ​​in a regression and cross entropy — evaluates the difference between predicted and true probability distributions in a classification.

  3. “Moving backwards”, or backpropagation.

    This step involves adjusting the network weights using an optimization technique called stochastic gradient descent. This is a method for finding the minimum of the loss function. The main task – calculate gradients — directions and magnitudes of changes to minimize the error of the loss function with respect to the weights and update the weights.

Convolutional neural networks are developed by artificial intelligence researchers working in academic or corporate labs.

Computer vision engineers are trained to produce results for solving specific problems. Often specialists use pre-trained models – this helps to significantly reduce time and resources. Pre-trained models tailored to specific data often show better performance due to transfer of knowledge (transfer learning). In addition, such models show more accurate results.

Why and where are convolutional neural networks used?

Convolutional neural networks are commonly used in computer vision problems because grid-structured data is mostly pictures. Typically their tasks include image classification, detection and segmentation.

In what areas is convolution most often used:

  1. Medicine. Convolutional neural networks analyze images to detect pathologies and make an accurate diagnosis. For example, they study x-rays, MRI, CT results.

  2. Instrumentation. In unmanned systems, convolutional neural networks process visual data. Usually this is an image of a video stream from a DVR. In this way, the models recognize road signs, lanes and other objects to make driving safer.

  3. Facial recognition systems. Convolutional neural networks are used for authentication in banks, video surveillance systems and other similar areas.

  4. Document recognition. Institutions actively use convolutional neural networks to decrypt scans of passports, SNILS and other documents.

How Neroset analyzes the results of medical research.  Source

How Neroset analyzes the results of medical research. Source

There are areas where convolutional neural networks are useless. Thus, they are worse at solving problems of processing sequential data or time dependencies, as in texts. In texts, it is important to consider the relationship between words and their context. Therefore, recurrent neural networks are more suitable for text processing.

Maria Zharova, Data Scientist at Alfa-Bank

How convolutional neural networks will develop in the future

Convolutional neural networks have several most likely directions for development:

  1. Integration with other types of neural networks. For example, a combination of convolutional neural networks with transformers—models primarily for text processing—is already being actively used. They handle translations, text generation and answer questions. It helps to process multimodal data, that is, information from different sources or in the form of different types of data, such as audio, video, images, text.

  2. Optimization of calculations. More technologically advanced and efficient architectures and algorithms will be developed. They will help speed up the training of neural networks and the accuracy of their work.

  3. Expansion of areas of use. Convolutional neural networks perform well not only when working with images; they are also effective for other types of unstructured data. Therefore, the areas of application of SNA will become more numerous.

How to learn to work with convolutional neural networks

Convolutional neural networks are a promising direction. Therefore, the demand for specialists who know how to handle them will only grow. If you want to develop in this area, you need knowledge in the following areas:

  1. Machine learning. Understand the process of training models. Know the architecture of neural networks, their parameters and problem areas, optimization, etc.

  2. Programming. Be able to write code in Python, work with libraries for neural networks and working with images – TensorFlow, PyTorch, OpenCV.

  3. Linear algebra and mathematics. Understand matrices, vectors, and convolution calculations.

To work basicly with convolutional neural networks, it is enough to know the basics of machine learning and be able to write code in Python. But if you want to develop in your profession and perform more complex tasks, you cannot do without deep knowledge of mathematics.

Maria Zharova, Data Scientist at Alfa-Bank

Which convolutional neural networks should a beginner use?

  1. AlexNet. The first deep convolutional neural network and the progenitor of all modern convolutional architectures. She also won the ImageNet 2012 competition. AlexNet has datasets that help beginners train.

  2. ResNet. A convolutional neural network based on the concept of residual connections – skip connections. Thanks to this, it can train very deep networks.

  3. VGGNet. Model with small convolutional filters. Ranked in the top 5 for accuracy when tested on ImageNet.

  4. EfficientNet. A heavy model with many parameters. Usually shows high metrics, especially if you take pre-trained weights.


Skillfactory and National Research Nuclear University MEPhI created master's program for those who want to deeply study neural networks. Students will be immersed in the process of creating intelligent systems: from design and training to implementation in real business processes. They will master the basics of mathematics and programming in Python, and will also be able to get real cases on ML training in IT companies that are partners of the program.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *