How we control the quality of models for detecting objects in images


Good day. Our names are Tatyana Voronova and Elvira Dyaminova, we are engaged in data analysis at Center 2M. In particular, we train neural network models for detecting objects in images: people, special equipment, animals.

At the beginning of each project, the company agrees with customers on an acceptable recognition quality. This level of quality must not only be ensured during the delivery of the project, but also maintained during the further operation of the system. It turns out that we must constantly monitor and train the system. I would like to reduce the costs of this process and get rid of the routine procedure, freeing up time for work on new projects.

Automatic retraining is not a unique idea; many companies have similar internal conveyor tools. In this article, we would like to talk about our experience and show that for the successful implementation of such practices it is not necessary to be a huge corporation.

One of our projects is counting people in lines. Due to the fact that the customer is a large company with a large number of branches, people accumulate at certain hours as scheduled, that is, a large number of objects (people’s heads) are regularly detected. Therefore, we decided to start the introduction of automatic retraining precisely on this task.

This is what our plan looked like. All items, except the scribbler, are carried out automatically:

  1. Once a month, all camera images for the last week are automatically selected.
  2. Image names are added to the shared xls page in sharepoint, and the status of the image files by default is set to “Not Viewed”.
  3. Using the last (currently working) version of the model, image markup is generated – xml files with markup (coordinates of the goals found) are also added to the folder, and the total number of objects found by the model is automatically entered on the page – this number will be needed in the future to monitor the quality of the model .
  4. Markers once a month view marked files in the “Not Viewed” status. Markup is corrected and the number of corrections is entered in the xls-page (separately – the number of deleted tags, separately – the number of added ones). The statuses of files viewed by the scribbler change to “Viewed”. Thus, we understand how the quality of our model has degraded.

    In addition, we clarify the nature of the error: whether the excess is usually marked out (bags, chairs) or, conversely, we don’t find a part of people (for example, because of medical masks). A graph of changing model quality metrics is displayed as a report panel.

  5. Once a month, the xls-file looks at the number of files in the “Viewed” status and the number of changes> 0. If the number is above the threshold value, the model is retrained on an extended set (with the addition of corrected markup). If the file was previously part of the training dataset, the old markup on the file changes to the new one. For files taken for training, the status changes to “Taken in training.” The status needs to be changed, otherwise the same files will be re-enrolled in retraining. Retraining is performed starting with the checkpoint remaining during the previous training. In the future, we plan to introduce further education not only according to the schedule, but also exceeding the threshold of the number of changes that had to be made in the markup.
  6. If the number of files in the “Viewed” status is 0, an alert is necessary – the markup for some reason does not check the markup.
  7. If, despite the model being retrained, the accuracy continues to fall, and the metrics fall below a threshold value, an alert is required. This is a sign that you need to understand the problem in detail with the involvement of analysts.

As a result, this process helped us a lot. We tracked an increase in errors of the second kind, when many goals unexpectedly became “masked”, enriched the training dataset with a new type of head in time, and upgraded the current model. Plus, this campaign allows you to take into account seasonality. We constantly adjust the dataset to suit the current situation: people often wear hats or, conversely, almost all come to the institution without them. In autumn, the number of people in hoods increases. The system becomes more flexible, responds to the situation.

For example, in the image below – one of the departments (on a winter day), whose frames were not presented in the training dataset:


If we calculate the metrics for this frame (TP = 25, FN = 3, FP = 0), it turns out that recall is 89%, accuracy is 100%, and the harmonic mean between accuracy and completeness is about 94, 2% (about metrics a little lower). Fairly good result for a new room.

In our training dataset there were both hats and hoods, so the model was not taken aback, but with the onset of the mask mode, it began to make mistakes. In most cases, when the head is clearly visible, there are no problems. But if a person is far from the camera, then at a certain angle, the model ceases to detect the head (the left image is the result of the old model). Thanks to the semi-automatic marking, we were able to fix such cases and retrain the model in time (the right image is the result of the new model).


Lady near:


When testing the model, frames were selected that did not participate in the training (dataset with a different number of people on the frame, from different angles and different sizes), to assess the quality of the model, we used recall and precision.

Recall – completeness shows what proportion of objects that really belong to the positive class, we predicted correctly.

Precision – accuracy shows what proportion of objects recognized as objects of a positive class, we predicted correctly.

When the customer needed one digit, a combination of accuracy and completeness, we provided a harmonic mean, or F-measure. Read more about metrics.

After one cycle, we got the following results:


The completeness of 80% before any changes is due to the fact that a large number of new branches have been added to the system, new angles have appeared. In addition, the time of year has changed; before that, “autumn-winter people” were presented in the training dataset.

After the first cycle, the completeness became 96.7%. If compared with the first article, then there the completeness reached 90%. Such changes are due to the fact that now the number of people in the departments has decreased, they have become much less overlapping (voluminous down jackets have ended), and the variety of hats has decreased.

For example, before the norm was about the same number of people as in the image below.


This is now the case.


Summing up, we will call the advantages of automation:

  1. Partial automation of the markup process.
  2. Timely response to new situations (universal wearing of medical masks).
  3. Quick response to incorrect model answers (the bag began to be detected as a head and the like).
  4. Monitoring model accuracy on an ongoing basis. When metrics change for the worse, the analyst is connected.
  5. Minimization of the analyst’s labor costs when upgrading the model. Our analysts are involved in various projects with full involvement, so we would like to take them off from the main project as little as possible to collect data and retrain on another project.

The downside is the human factor on the part of the scribbler – it may not be responsible enough for the markup, so markup with overlap or the use of golden sets is necessary – tasks with a predetermined answer that serve only to control the quality of the markup. In many more complex tasks, the analyst must personally check the markup – in such tasks the automatic mode will not work.

In general, the practice of automatic retraining has proven to be viable. Such automation can be considered as an additional mechanism that allows to maintain a good level of recognition quality during further operation of the system.

Authors of the article: Tatyana Voronova (tvoronova), Elvira Dyaminova (elviraa)

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *