YOLO or You Only Look Once is a convolutional neural network architecture that is used to recognize multiple objects in an image. In 2020, against the backdrop of a pandemic, the task of detecting objects (object detection) in the image has become more relevant than ever. This article provides a complete step-by-step guide for those who want to learn how to recognize objects using YOLO on different data. It is assumed that you already know how to do object recognition using deep learning techniques and, in particular, you know the basics of YOLO, so let’s dive into our task.
To implement the project, I used Google Colab. The first experiments with preprocessing were not computationally expensive, so they were performed on my laptop, but the model was trained on the Google Colab GPU.
LabelImg… However, both options are tedious and time consuming, especially the latter. There is another option, the most viable for my purpose – to work with a public dataset.
Face Mask Detection Kit by Kaggle and uploaded it straight to my google drive. Take a look here, How can I do that. The downloaded dataset is two folders: images, contains 853 .png files; annotations, contains 853 corresponding XML annotations.
After loading the dataset, in order to train our model, we need to convert the .xml files to .txt, or more specifically, to the YOLO format. Example:
Suppose this is an image annotation with only 3 bounding rectangles, as shown by the number of
… in the XML.
To create a suitable text file, we need 5 data types from each XML file. For each
… in the XML file, extract the class i.e. … field and bounding box coordinates – 4 attributes in … … A suitable format looks like this:
<class_name> <x_center> <y_center> <width> <height>
scriptwhich extracts 5 attributes of every object in every XML file and creates corresponding TXT files. You can find detailed comments on the conversion approach in my script. For example, Image1.jpg should have a corresponding Image1.txt, like this:
1 0.18359375 0.337431693989071 0.05859375 0.10109289617486339
0 0.4013671875 0.3333333333333333 0.080078125 0.12021857923497267
1 0.6689453125 0.3155737704918033 0.068359375 0.13934426229508196
This is an exact conversion of the aforementioned .xml file to suitable text.
Note: It is very important to group images and corresponding TXTs into one folder.
Of course, before we start training the model, we need to be absolutely sure that the transformation was correct, because we want to feed the model with valid data. To ensure compliance, I wrote
script, which takes the image and its corresponding text annotation from the given folder and displays the captured image with bounding boxes. Here’s what happened:
So far so good, let’s continue.
To train our model and test it during the training phase, we have to split the data into two sets – a training set and a testing set. The proportion was 90 and 10%, respectively. So I created two new folders and put 86 annotated images in test_folder and the remaining 767 images in train_folder.
Cloning the darknet framework
The next step is to clone
darknet repository using the command:
!git clone https://github.com/AlexeyAB/darknet
After that, we need to load the weights of the pretrained model, that is, apply transfer learning, rather than train the model from scratch.
darknet53.conv.74 is the backbone of the YOLOv3 network, which first learns classification on the ImageNet dataset and acts as an extractor. To use it for recognition, additional weights of the YOLOv3 network are randomly initialized before training. But, of course, during the training phase, the network will receive proper weights.
To complete the preparation and start training the model, you need to create five files.
face_mask.names: Create a _.names file containing the task classes.
In our case, the original Kaggle dataset has 3 categories: with_mask, without_mask and mask_weared_incorrect [с маской, без маски, маска надета неправильно]…
To make things a little easier, I’ve combined the last two categories into one. So, there are two categories, Good and Bad, based on whether someone is wearing their mask correctly:
face_mask.data: create a _.data file that contains relevant information about our task, the program will work with it:
classes = 2
train = data/train.txt
valid = data/test.txt
names = data/face_mask.names
backup = backup/
Note: If there is no backup folder, create it, because weights will be saved there for every thousand iterations. In fact, these will be your checkpoints in case the training is unexpectedly interrupted; if anything, you can continue training the model.
3.face_mask.cfg: This config file should be adapted to our task, namely we need to copy yolov3.cfg, rename it to _.cfg and change the code as described below:
batch line to
batch = 64;
line subdivisions by
subdivisions = 16… If there is a memory problem, increase this value to 32 or 64;
input sizes to standard ones: width = 416, height = 416;
max_batches to (#classes * 2000), this will give 4000 iterations for our problem.
I started with a resolution of 416×416 and trained my model for 4000 iterations, but in order to achieve better accuracy, I increased the resolution and extended the training for another 3000 iterations. If you only have one category, you shouldn’t train your model to only 2000 iterations. It is assumed that 4000 iterations is the minimum.
change the steps line to 80% and 90% max_batches. In our case, 80/100 * 4000 = 3200, 90/100 * 4000 = 3600;
press Ctrl + F and look for the word “yolo”. The search will lead directly to yolo_layers, where you change the number of classes (in our case classes = 2) and the number of filters. The variable filters is the second variable above the line [yolo]…
The line should become: filters = (classes + 5) * 3, in our case it is filters = (2 + 5) * 3 = 21. There are 3 yolo_layers in the .cfg file, so the changes mentioned above need to be done three times.
4. Files train.txt and test.txt: These two files have been included in the face_mask.data file and indicate the absolute path of each image to the model. For example, a snippet of my train.txt file looks like this:
As I said, the .png files must be located in the same folder with the corresponding text annotations.
This means that the project is structured like this:
├──annotations (contains original .xml files)
├──images (contains the original .png images)
├──mask_yolo_test (contains .png % .txt files for testing)
├──mask_yolo_train (contains .png % .txt files for training)
After compiling the model, you need to change the rights to the darknet folder, like this:
!chmod +x ./darknet
And we begin to train the model by running the following command:
!./darknet detector train data/face_mask.data cfg/face_mask.cfg backup/face_mask_last.weights -dont_show -i 0 -map
We register the -map flag so that important indicators such as average Loss, Precision, Recall, AveragePrecision (AP), meanAveragePrecsion (mAP), etc. are displayed in the console.
However, the mAP indicator in the console is considered a better metric than Loss, so train the model as long as the mAP is increasing.
Training can take hours depending on various parameters, this is normal. It took me about 15 hours, but I got my first impressions of the model after about 7 hours, that is, 4000 iterations.
The model is ready for demonstration. Let’s try using images that she has never seen before. To do this, you need to run the following commands:
!./darknet detector test data/face_mask.data cfg/face_mask.cfg backup/face_mask_best.weights
Did you notice that we used face_mask_best.weights and not face_mask_final.weights? Fortunately, our model keeps the best weights (mAP reached 87.16%) in the backup folder in case we train it for more epochs than we should (which might lead to overfitting).
The images below are taken from
Pexels, high-resolution image set, and the naked eye can see that they differ significantly from the test and training datasets and, thus, have a different distribution. To see how the model is generalizable, I selected these photos:
In the images above, the model fired accurately and she is pretty confident in her predictions. It is noteworthy that the image on the right did not confuse the model with the mask on the globe: the model shows that predictions are made not only on the basis of whether the mask is on, but also on the basis of the context around the mask.
The two images above obviously show that people are not wearing masks, and the model seems to be able to recognize that quite easily.
Using the two examples above, you can check the performance of the model in cases where the image contains people with and without masks. The model can identify faces even against a blurry background, and this fact is admirable.
I noticed that the model is not as confident about the person in front of him (38% in the clear area) compared to the forecast for the person immediately behind him (100% in the blurred area). This can be related to the quality of the training dataset, so the model seems to be influenced to some extent (at least it is not imprecise).
One last test
Of course, Yolo’s big advantage is its speed. Therefore, I want to show you how it works with video:
!./darknet detector demo data/face_mask.data cfg/face_mask.cfg backup/face_mask_best.weights -dont_show vid1.mp4 -i 0 -out_filename res1.avi
Optimized for HabraStorage, lossy gif.
This was my first step-by-step tutorial on how to make your own detector using YOLOv3 on a custom dataset. Hope it was helpful to you. And if you want to learn how to create your own neural networks and solve problems using deep learning – pay attention to the course
Machine Learning and Deep Learning…
find outhow to level up in other specialties or master them from scratch:
Other professions and courses