Face mask recognition with YOLOv3

YOLO or You Only Look Once is a convolutional neural network architecture that is used to recognize multiple objects in an image. In 2020, against the backdrop of a pandemic, the task of detecting objects (object detection) in the image has become more relevant than ever. This article provides a complete step-by-step guide for those who want to learn how to recognize objects using YOLO on different data. It is assumed that you already know how to do object recognition using deep learning techniques and, in particular, you know the basics of YOLO, so let’s dive into our task.

I’m going to work with YOLOv3 – one of the most popular versions of YOLO, which includes a modern, amazingly accurate and fast object detection system in real time. Newer versions like YOLOv4, YOLOv5 can achieve even better results. You can find this project in my repositories on Github.

Work environment

To implement the project, I used Google Colab. The first experiments with preprocessing were not computationally expensive, so they were performed on my laptop, but the model was trained on the Google Colab GPU.

You can activate the GPU in Colab in the Edit-> Notebook Settings menu.” title=”You can activate the GPU in Colab in the Edit-> Notebook Settings menu.” width=”533″ height=”306″><figcaption>You can activate the GPU in Colab in the Edit-> Notebook Settings menu.</figcaption></figure><h3>Data set</h3><p>To begin with, to make a mask detector, you need the appropriate data.  Also, due to the nature of YOLO, annotated data with bounding boxes is needed.  One option is to create your own dataset, either by collecting images from the Internet, or by taking pictures of friends, acquaintances and annotating the photos manually using certain programs, such as <a href=LabelImg… However, both options are tedious and time consuming, especially the latter. There is another option, the most viable for my purpose – to work with a public dataset.

I chose Face Mask Detection Kit by Kaggle and uploaded it straight to my google drive. Take a look here, How can I do that. The downloaded dataset is two folders:

  • images, contains 853 .png files;

  • annotations, contains 853 corresponding XML annotations.

After loading the dataset, in order to train our model, we need to convert the .xml files to .txt, or more specifically, to the YOLO format. Example:


Suppose this is an image annotation with only 3 bounding rectangles, as shown by the number of in the XML.

To create a suitable text file, we need 5 data types from each XML file. For each in the XML file, extract the class i.e. field and bounding box coordinates – 4 attributes in … A suitable format looks like this:

<class_name> <x_center> <y_center> <width> <height>

I wrote scriptwhich extracts 5 attributes of every object in every XML file and creates corresponding TXT files. You can find detailed comments on the conversion approach in my script. For example, Image1.jpg should have a corresponding Image1.txt, like this:

1 0.18359375 0.337431693989071 0.05859375 0.10109289617486339
0 0.4013671875 0.3333333333333333 0.080078125 0.12021857923497267
1 0.6689453125 0.3155737704918033 0.068359375 0.13934426229508196

This is an exact conversion of the aforementioned .xml file to suitable text.

Note: It is very important to group images and corresponding TXTs into one folder.

Of course, before we start training the model, we need to be absolutely sure that the transformation was correct, because we want to feed the model with valid data. To ensure compliance, I wrote script, which takes the image and its corresponding text annotation from the given folder and displays the captured image with bounding boxes. Here’s what happened:

So far so good, let’s continue.

Data sharing

To train our model and test it during the training phase, we have to split the data into two sets – a training set and a testing set. The proportion was 90 and 10%, respectively. So I created two new folders and put 86 annotated images in test_folder and the remaining 767 images in train_folder.

Cloning the darknet framework

The next step is to clone darknet repository using the command:

!git clone https://github.com/AlexeyAB/darknet

After that, we need to load the weights of the pretrained model, that is, apply transfer learning, rather than train the model from scratch.

!wget https://pjreddie.com/media/files/darknet53.conv.74

darknet53.conv.74 is the backbone of the YOLOv3 network, which first learns classification on the ImageNet dataset and acts as an extractor. To use it for recognition, additional weights of the YOLOv3 network are randomly initialized before training. But, of course, during the training phase, the network will receive proper weights.

Last step

To complete the preparation and start training the model, you need to create five files.

  1. face_mask.names: Create a _.names file containing the task classes.

In our case, the original Kaggle dataset has 3 categories: with_mask, without_mask and mask_weared_incorrect [с маской, без маски, маска надета неправильно]…

To make things a little easier, I’ve combined the last two categories into one. So, there are two categories, Good and Bad, based on whether someone is wearing their mask correctly:

1. Good.
2. Bad.
  1. face_mask.data: create a _.data file that contains relevant information about our task, the program will work with it:

classes = 2
train = data/train.txt
valid  = data/test.txt
names = data/face_mask.names
backup = backup/

Note: If there is no backup folder, create it, because weights will be saved there for every thousand iterations. In fact, these will be your checkpoints in case the training is unexpectedly interrupted; if anything, you can continue training the model.

3.face_mask.cfg: This config file should be adapted to our task, namely we need to copy yolov3.cfg, rename it to _.cfg and change the code as described below:

  • batch line to batch = 64;

  • line subdivisions by subdivisions = 16… If there is a memory problem, increase this value to 32 or 64;

  • input sizes to standard ones: width = 416, height = 416;

  • string max_batches to (#classes * 2000), this will give 4000 iterations for our problem.

I started with a resolution of 416×416 and trained my model for 4000 iterations, but in order to achieve better accuracy, I increased the resolution and extended the training for another 3000 iterations. If you only have one category, you shouldn’t train your model to only 2000 iterations. It is assumed that 4000 iterations is the minimum.

  • change the steps line to 80% and 90% max_batches. In our case, 80/100 * 4000 = 3200, 90/100 * 4000 = 3600;

  • press Ctrl + F and look for the word “yolo”. The search will lead directly to yolo_layers, where you change the number of classes (in our case classes = 2) and the number of filters. The variable filters is the second variable above the line [yolo]…

    The line should become: filters = (classes + 5) * 3, in our case it is filters = (2 + 5) * 3 = 21. There are 3 yolo_layers in the .cfg file, so the changes mentioned above need to be done three times.

4. Files train.txt and test.txt: These two files have been included in the face_mask.data file and indicate the absolute path of each image to the model. For example, a snippet of my train.txt file looks like this:


As I said, the .png files must be located in the same folder with the corresponding text annotations.

This means that the project is structured like this:

      ├──annotations       (contains original .xml files)
      ├──images            (contains the original .png images)
      ├──mask_yolo_test    (contains .png % .txt files for testing)
      ├──mask_yolo_train   (contains .png % .txt files for training)
      ├── show_bb.py
      └── xml_to_yolo.py

Start date

After compiling the model, you need to change the rights to the darknet folder, like this:

!chmod +x ./darknet

And we begin to train the model by running the following command:

!./darknet detector train data/face_mask.data cfg/face_mask.cfg backup/face_mask_last.weights -dont_show -i 0 -map

We register the -map flag so that important indicators such as average Loss, Precision, Recall, AveragePrecision (AP), meanAveragePrecsion (mAP), etc. are displayed in the console.

However, the mAP indicator in the console is considered a better metric than Loss, so train the model as long as the mAP is increasing.

Training can take hours depending on various parameters, this is normal. It took me about 15 hours, but I got my first impressions of the model after about 7 hours, that is, 4000 iterations.


The model is ready for demonstration. Let’s try using images that she has never seen before. To do this, you need to run the following commands:

!./darknet detector test data/face_mask.data cfg/face_mask.cfg backup/face_mask_best.weights

Did you notice that we used face_mask_best.weights and not face_mask_final.weights? Fortunately, our model keeps the best weights (mAP reached 87.16%) in the backup folder in case we train it for more epochs than we should (which might lead to overfitting).

The images below are taken from Pexels, high-resolution image set, and the naked eye can see that they differ significantly from the test and training datasets and, thus, have a different distribution. To see how the model is generalizable, I selected these photos:

In the images above, the model fired accurately and she is pretty confident in her predictions. It is noteworthy that the image on the right did not confuse the model with the mask on the globe: the model shows that predictions are made not only on the basis of whether the mask is on, but also on the basis of the context around the mask.

The two images above obviously show that people are not wearing masks, and the model seems to be able to recognize that quite easily.

Using the two examples above, you can check the performance of the model in cases where the image contains people with and without masks. The model can identify faces even against a blurry background, and this fact is admirable.

I noticed that the model is not as confident about the person in front of him (38% in the clear area) compared to the forecast for the person immediately behind him (100% in the blurred area). This can be related to the quality of the training dataset, so the model seems to be influenced to some extent (at least it is not imprecise).

One last test

Of course, Yolo’s big advantage is its speed. Therefore, I want to show you how it works with video:

!./darknet detector demo data/face_mask.data cfg/face_mask.cfg backup/face_mask_best.weights -dont_show vid1.mp4 -i 0 -out_filename res1.avi

Optimized for HabraStorage, lossy gif.
Optimized for HabraStorage, lossy gif.

This was my first step-by-step tutorial on how to make your own detector using YOLOv3 on a custom dataset. Hope it was helpful to you. And if you want to learn how to create your own neural networks and solve problems using deep learning – pay attention to the course Machine Learning and Deep Learning

find outhow to level up in other specialties or master them from scratch:

Other professions and courses

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *