How the neural network of sights on photographs recognized

Introduction

Hi all. This is my first post and first job review. In a nutshell, I will describe what I was doing here.

The goal of the project was to recognize sights in photographs using machine learning, namely convolutional neural networks. This topic was chosen for the following reasons:

  • the author already had some experience with computer vision tasks

  • the task sounded like it could be done very quickly and without a lot of effort, and, importantly, computational resources (all meshes were trained in colab or on kagle)

  • the task may have some practical application (well, in theory…)

At first it was planned as a purely educational project, but then I was imbued with his idea and decided to finalize it to such a state that I can.

Next, I will talk about how I approached solving this problem, and at the same time I will try to follow the code from laptop, in which all the magic took place, while trying to explain some of their actions. Perhaps this will help someone get over the “blank slate” fear and see that this kind of thing is really easy to do!

Tools

Well, first of all, I will tell you about the tools that were used in the implementation of the project.

  • Colab/Kaggle: Used to train networks on the GPU.

  • Weights And Biases: a service in which I saved models, their descriptions, added losses, metric values, training parameters, preprocessing. In general, kept a full account. The data can be found at link. In the process of writing the code, the metadata section was slightly changed, which, in fact, contains the training and preprocessing parameters. In the files section, you can get acquainted with the description of the network (how its layers are arranged), download the trained weights of the network, and also look at the value of losses and metrics.

Data for training

Well, it would probably be worth starting with the choice of data for training the neural network. To do this, I dug into datasets on cable (poke), and here is website I still liked it.

Actually, as it turned out, there is a competition from Google, connected just the same with the recognition of sights. Here the first problem appeared: the dataset weighs \approx100 GB Realizing that in the future I will not teach grids at my bakery, I had to abandon this option. After browsing more, I stopped at this dataset. It contains 210 classes and approximately 50 photographs for each class. The pictures are all different sizes, taken from different angles, from different distances. In general, the dataset is not at all refined, but so far I have only worked with such. Well, you don’t need to mark up the data yourself, for that, like it! Here are a couple of very good photos:

So, for example, it looks "Bolshoi Theater in Moscow"

So, for example, the Bolshoi Theater in Moscow looks like

And here, instead of a photograph of Central Park in New York, they slipped us his scheme

And here, instead of a photograph of Central Park in New York, they slipped us his scheme

Data storage and processing (part 1)

In this section, I would like to talk about how the data was stored and processed.

First of all, we will check the pictures for how many channels they contain. It is most common to work with images with three channels (RGB). But in addition to this format, in this dataset we also find black-and-white pictures and RGBA photos. But, fortunately, there are very few such pictures (19), so we will delete them without remorse.

To store data, I wrote several classes that are inherited from torch.utils.data.Dataset. When implementing such classes, a necessary condition is the overriding of methods __getitem__ And __len__ (that is, add the ability to the class to get the element by index, and return the length of the class instance). Well, that’s what I did.

Dataset with fast access (FastDataset)

The first thing that came to mind: let’s just read the images, bring them to the same size, translate them into tensors pytorch, and store tensors. Further, when we want to iterate over the elements of the dataset, we simply get the tensors from memory without performing any more processing. Super, now we have saved all the data, waited a bit during initialization, but access is instant (almost). It would seem that what could go wrong… But the answer is, in fact, obvious: storing processed data is not a cheap pleasure, and you have to pay for it minted coin memory. What to do…

Dataset with slow access (CustomDataset)

The second thing that came to mind: let’s just store a list containing the paths to our pictures. Thus, memory costs become many times less. But with this approach, we sacrifice the time during which the bypass occurs. Indeed, when storing data in the form of a list of paths, with each access, we must read the image along its path, apply the operations of resizing and casting to tensors, and only after that we can work with the resulting object. For a long time, yes, but there’s nothing to be done.

Network training

In this section, we will move away from the laptop a bit.

Data partitioning

So, in our arsenal there are already two types of datasets. Let’s learn something. To do this, we need to write, probably, a network training cycle, in which we will also calculate metrics on the validation set for the competent selection of network hyperparameters. But in order for such a sample to appear, you need to learn how to split the data. To do this, in each class, I implemented a method for splitting the original dataset into training and validation. I wrapped it all with simple functions, and at the output I got conveniently fed instance networks torch.DataLoader.

Education

When training, the Adam optimizer from the module was used pytorchand was chosen as the loss function nn.CrossEntropyLoss.

At first, I tried to write and train very simple networks, which consisted of two parts: a convolutional part, in which convolutions (shock) and pullings were used; fully connected part, which used line layers and a little dropouts (on wandb this is the zero version of CNN). It became clear that it was necessary to complicate the architecture. I added layers of batch normalization, and the quality jumped very nicely. Before that, I treated this with disdain, because I didn’t really understand how it works (and even now, too). In general, using the trial and error method, it was possible to raise the quality of the metric value F1 on the validation sample up to 93%. Then I thought that the goal was achieved, and it turned out to get off with little blood, but it was not there. And just to make sure that everything is really good, I decided to google about the metric that I used. It turned out that it was not at all what I expected, and when I corrected everything, the value of the metric on the validation set returned to 31%, but on the training set it was 96%. Now this is ours! The network retrains well this way. Let’s solve the problem.

Data storage and processing (part 2)

The first idea that came to my mind was that the grid simply cannot learn from 45 images, some of which are not of the best quality yet. What can be done about it? Well, let’s apply augmentation. I do not know what contingent is reading this manuscript, so I will give a brief explanation. Augmentation, in fact, is an increase in the amount of data, due to which a grid can be trained. Let’s try to artificially expand the set of existing images by applying some transformations to them.

The idea is as follows: let’s apply a set of transformations to each existing image, for example, rotate by 180 degrees, or rotate the image a little.

These are the pictures obtained from the top left.

These are the pictures obtained from the top left.

We were able to expand our dataset by as much as 7 times! Let’s use this to train the mesh.

Next, I implemented two more classes, by analogy with datasets from Part 1: with quick access AugmentedFastDataset and slow access AugmentedCustomDataset. The problem arose instantly: already at this stage, when applying 7 different types of transformations, the dataset with fast access ate all the memory, and everything fell. Accordingly, I had to use its less fast, but more economical (in terms of memory) version.

Well, what do we see (you can see CNN.v9): the model still overfits a lot. What else can you think of…

And the following idea came to mind: why apply one transformation (I will call it the operation of changing the original image) at a time? You can use them sequentially. Then, by making various combinations, we can further expand the dataset. Let’s try to implement this idea in the class AdvancedCustomDataset. I will briefly describe the process: we pass an argument to the class constructor ex_amount, which is responsible for how many instances for each class we want to get. Next, we go through each class, and until we get the desired number of images, we apply a random set of transformations to a random image. Below, you can see an example of how this idea works.

From one picture we got 44

From one picture we got 44

Also, there have been some minor changes related to the replacement of some functions with their counterparts from other modules. The reason is simple: since the dataset has grown a lot, and access to the elements is slow, a lot of time is spent on one pass. Therefore, it would be nice to save a couple of minutes on such trifles.

  • the opening of the image was previously done using the library PIL. Comparisons have shown that opening images using the library cv2 works much faster. Therefore, unlike other classes, the last dataset uses an analogue from cv2

  • operations to change images were taken from the module torchvision.transforms. As it turned out later, similar functions from the module albumentations work faster.

Well, we’ve done the data, we’ve adjusted the metrics, it’s time to study. Since now I was free to choose how many pictures for each class I want, I set the value to 2000. And after a long process of training and validation, we get model with meaning F1 = 60% on the validation sample. Already good, I think.

A little about Fine Tuning

Well, we have already received some acceptable quality, let’s now digress from the self-written architecture, and try to retrain an existing network. As such a network, I took the model VGG13 with batch normalization with pretrained weights. Further, I froze the entire convolutional part, played a little with the classifier, and set it all to learn. It turned out even better than before: the metric on the validation set is 70% (poke).

Afterword

So, what we got as a result: two networks that work well and really recognize images with high quality. I even admit that the same thirty percent error occurs due to crooked photographs in the dataset (I gave examples above).

I tried to arrange the whole thing into a mini-project that can be downloaded from github.

Please write any comments related to the written, I will be glad to gain experience!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *