How a convolutional neural network works

To the start of the course machine and deep learning we decided to share the translation of an article with a visual explanation of how CNN works – networks based on the principles of the visual cortex of the human brain. Unobtrusively, as if between the lines, the author prompts thoughts about the reasons for the effectiveness of CNN and, using simple examples, explains the transformations taking place inside these neural networks.

Let’s start over

Let's start with the CNN sequence that classifies handwritten numbers
Let’s start with the CNN sequence that classifies handwritten numbers

A convolutional neural network (ConvNet / CNN) is a deep learning algorithm that can take an input image, assign importance (studied weights and biases) to aspects or objects in the image, and distinguish one from the other. At the same time, images, in comparison with other algorithms, require much less preprocessing. In primitive methods, filters are designed by hand, but sufficiently trained CNNs learn to apply these filters / characteristics.

The architecture of CNN is similar to the structure of the connections of neurons in the human brain, scientists drew inspiration from the organization of the visual cortex of the brain. Individual neurons respond to stimuli only in a certain area of ​​the visual field, also known as the perceptual field. Many perceptual fields overlap to completely cover CNN’s field of view.

Why convolution layers are over a feedforward mesh

3 × 3 matrix as a 9 × 1 vector
3 × 3 matrix as a 9 × 1 vector

An image is nothing more than a matrix of pixel values, right? So why not flatten it (for example, make a 3 × 3 matrix a 9 × 1 vector) and feed this vector to a multilayer perceptron to do the classification? Hmm … it’s not that simple.

In the case of the simplest binary images, when performing class prediction, the method may show an average accuracy, but in practice, when it comes to complex images in which pixel dependencies are everywhere, it will turn out to be inaccurate.

CNN is able to successfully capture spatial and temporal relationships in an image through the application of appropriate filters. This architecture, by reducing the number of parameters involved and the ability to reuse weights, gives a better fit to the image dataset. In other words, the network can be taught to better understand the complexity of the image.

Input image

RGB image 4 × 4 × 3
RGB image 4 × 4 × 3

In the figure, we see an RGB image divided into three color planes (red, green and blue), which can be described in different color spaces – in grayscale (Grayscale), RGB, HSV, CMYK, etc.

You can imagine how computationally intensive the computation will be when the images reach dimensions such as 8K (76804320). CNN’s role is to bring images into a form that is easier to process, without losing features that are critical to getting a good forecast. This is important when designing an architecture that not only learns functions well, but also scales for massive datasets.

Convolution layer – core

Convolution of a 5 × 5 × 1 image with a 3 × 3 × 1 kernel to obtain a collapsed feature of 3 × 3 × 1
Convolution of a 5 × 5 × 1 image with a 3 × 3 × 1 kernel to obtain a collapsed feature 3 × 3 × 1

Image dimensions:

  1. 5 – height;

  2. 5 – width;

  3. 1 – number of channels, for example, RGB.

In the demo above, the green section resembles our 5x5x1 input image. The element involved in performing the convolution operation in the first part of the convolution layer is called the kernel / filter K, and it is represented in yellow. Let K be a 3 × 3 × 1 matrix:

Kernel/Filter, K = 

1 0 1
0 1 0
1 0 1

The kernel is shifted 9 times due to the step length of one (that is, there is no step), each time performing the operation of multiplying the matrix K by the matrix P, over which the kernel is located.

Moving the core
Moving the core

The filter moves to the right with a certain step value until it parses the entire width. Moving on, it jumps to the beginning of the image (on the left) with the same step value and repeats the process until it traverses the entire image.

Convolution operation on an M × N × 3 image matrix with a 3 × 3 × 3 kernel
Convolution operation on an M × N × 3 image matrix with a 3 × 3 × 3 kernel

In the case of images with multiple channels (eg RGB), the kernel has the same depth as the input image. Matrix multiplication is performed between stacks Kn and In ([K1, I1]; [K2, I2]; [K3, I3]), all the results are summed with the bias to obtain a flattened channel for the output of convoluted features with a depth of 1.

Convolution operation with step length 2
Convolution operation with step length 2

Convolution is done to extract high-level features, such as the edges of the input image. The network does not need to be limited to a single layer. The first layer is conditionally responsible for capturing low-level features such as edges, color, gradient orientation, etc. Through additional layers, the architecture adapts to high-level features, we get a network with a sound understanding of images in a dataset similar to ours.

Convolution results have two types: the first – the collapsed feature decreases in size compared to the size at the input, the second type concerns the dimension – it either remains the same or increases. This is done by applying valid padding in the first case, or zero padding in the second.

Zero padding: to create a 6 × 6 × 1 image, a 5 × 5 × 1 image is padded with zeros
Zero padding: to create a 6 × 6 × 1 image, a 5 × 5 × 1 image is padded with zeros

By zooming in on a 5 × 5 × 1 image to 6 × 6 × 1, and then passing over it with a 3 × 3 × 1 kernel, we find that the collapsed matrix will have a resolution of 5 × 5 × 1. Hence the name – zero filling. On the other hand, doing the same without filling, we find a matrix with the dimensions of the core itself (3 × 3 × 1); this operation is called valid padding.

IN this The repositories contain many of these GIFs to help you better understand how padding and stride work together to achieve the desired results.

Union layer

Combining 3 × 3 over a 5 × 5 collapsed feature
Combining 3 × 3 over a 5 × 5 collapsed feature

Similar to the convolutional layer, the merge layer is responsible for reducing the size of the collapsed object in space. This is done to reduce the processing power required for data processing by reducing the dimension. In addition, it is useful for extracting dominant features that are rotational and positional invariants, thereby allowing the model to be trained efficiently.

There are two types of pooling: maximum and average. The first returns the maximum value from the part of the image covered by the kernel. And the mean union returns the mean of all the values ​​of the part covered by the kernel.

Maximum combining also serves as a noise canceling function. It completely discards noisy activations and also removes noise along with dimensionality reduction. On the other hand, medium combining for noise suppression simply reduces the dimension. So we can say that the maximum pooling works much better than the average pooling.

Union types
Union types

The union and convolution layers together form the i-th layer of the convolutional neural network. The number of such layers can be increased depending on the complexity of the images in order to better capture details, but this is done at the expense of increased processing power.

Performing the above process allows the model to understand the features of the image. We transform the result into a columnar vector and feed it to a regular classifying neural network.

Classification – Fully Connected Layer

Adding a fully connected layer is (usually) a computationally inexpensive way to train non-linear combinations of high-level features that are presented in the output of the convolution layer. A fully connected layer studies a function in this space, which may be nonlinear.

After transforming the input image into a shape suitable for the multilevel perceptron, we must flatten the image into a vector column. The smoothed output is fed to a feedforward neural network, with backpropagation applied at each training iteration. Over a series of epochs, the model acquires the ability to distinguish between dominant and some low-level features in images and classify them using the Softmax classification method.

CNN has various architectures that have played a key role in building the algorithms on which artificial intelligence as a whole stands and will stand for the foreseeable future. Some of these architectures are listed below:

  1. LeNet;

  2. AlexNet;

  3. VGGNet;

  4. GoogLeNet;

  5. ResNet;

  6. ZFNet.

Repository with a digit recognition project.

Today, three years after the appearance of this article, CNNs, their architectures and concepts are being actively revised, which means they continue to evolve, presenting new, more accurate solutions to various problems, and all the data that is becoming more and more, of course, can be visualized. This means that CNN has a huge number of practical applications; and if you are interested in experiments and searches in the field of AI, then you can pay attention to our course on machine and deep learning, and if you want to work with data more independently, then you can take a closer look at our flagship Data Science course

find outhow to level up in other specialties or master them from scratch:

Other professions and courses

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *