How does this work

.

You can also accumulate and average the descriptor by track for potential gluing of broken tracks:

  • perform ReID of the same object that was lost;

  • assign it the same ID that it had before the track was lost.

Results in publications (for example, in BoT-SORT) show that each tracker, when added to its logic, increases the metrics and allows to reduce the number of IDs. We also do similar matching with history by similarity of descriptors, when possible.

Tracking schemes can grow to alarming proportions. An example is BoT-SORT-ReID:

Ours is also scary and generally close to this one. Only it doesn't have post-processing on history, we do everything real-time.

Bestshots selection

So, we have a car track, hurray. If we tracked it for at least two seconds, we already have 50 frames, which we can use to perform the necessary recognition. But, as before, we don’t have the time and resources for 50 frames. We only have resources for K frames, where K is a specified parameter, and it must be at least 3. This means that these frames must be the best, i.e. bestshots.

As with tracking, the logic for selecting bestshots can be configured for the task and domain — depending on the speed of movement, traffic congestion, vehicle overlaps, and other features. The simplest logic is to select vehicle crops as bestshots if they are not too close to the edge of the frame and do not intersect with the boxes of other vehicles, if any. In the best case, a separate grid can be trained to evaluate the quality of the crop: it will determine whether it is suitable for use in recognition.

The illustration of our pipeline below shows examples of bad (red) and good (green) frames. As you can see, the latter are indeed more suitable for recognition. All significant parts of the cars are visible, including the number:

Attribute recognition

So, we have K great frames from the track. Now we want to recognize attributes from them.

The screenshot above shows the entire set of features that we can recognize. The numbers under the attributes are the degree of confidence with which the network classifies an object into a specific class, in fact, the probability of this particular class among all. All these are the results of our LUNA CARS product, and the screenshot itself was made in its interface.

Almost all attributes are recognized using classifier networks. Their task is to select the most probable variant for a specific image, or TS crop, from a fixed set of classes (for example, from 14 types).

The classification task is the most understandable and solved in CV, but still, there are difficulties with the preparation of each network. If under any conditions it is necessary to recognize a complex attribute (say, a vehicle model), it will be necessary to collect data: web downloads, data from cameras and implementations. Then conduct experiments with architectures and a training pipeline, close Domain Gap — this is the difference in recognition accuracy between different domains. For example, the network may work great on pictures from the Internet, but not on real city cameras. And it is not always possible to solve this problem by collecting target data. But we would like to.

Problems may arise already at the stage of formalizing the task and rules – for example, vehicles may be of the same model, but differ in body type. Should they be distinguished or not? Or the same model may be released by manufacturers under different names – do we really want to distinguish them? And how to choose a set of colors for classification? We have to suffer with these questions, but in any case, we solve them at the product level.

Number of axles

I would like to tell you separately about calculating the number of axles. The price of travel on toll roads depends on how many axles a vehicle has, so this attribute is important. How can you recognize it?

Of course, you can simply train the classifier for the classes “one axle”, “two axles”, “three axles” – and so collect M classes. But here there would be questions with the marking: if the vehicle is not visible in its entirety, for example, it has only one axle in the frame, then what class should we predict? And if the axles of other cars fit into the frame, will we mark them too?

With the classification approach, you can probably get a good result. But we decided to implement the counting through detection using a fast and easy detector grid, which we launch on the crop.

But the same problem arises – it can be difficult to understand whether a wheel belongs to this vehicle or not. The thing is that inside our detector we do not perform segmentation and do not find a mask of a specific vehicle. This means that we cannot understand whether all the wheels found belong to the car we are interested in. If we simply count them, the result will be incorrect.

Therefore, we decided to take advantage of the fact that all the axes of the car must lie on one straight line. Then we can calculate the coordinates of the centers of all the boxes of the axes found on the crop, and with the help of RANSAC-regression to draw the optimal line through subset (sub-set, prefetch) of these centers. All points that are not included in it, we will consider as foreign axes – they can be thrown out.

How RANSAC works. There is a set of points A, from which a certain number of subsets are collected. When there are few points, we can go through absolutely all subsets of four or five points that we have. Then, for each subset, we build a line, estimate the distance from each point to this line and choose which of them fit the line better (these will be inlayers) and which do not fit (outliers). All lines are estimated using the distances of all points to it. The best line is chosen: the largest number of center points are located close to it.

For example, we have a line that in a particular subset perfectly passes through three points. And the fourth point is quite far from the line. This is probably an outlier – someone else's wheel, so we will not use it in the calculation. As an answer for the calculation of the axes, we choose the number of points that fit well on our optimal line.

Below you can see how RANSAC– the method works for a large number of points:

The results of this approach for a road camera:

This method is not ideal. It may turn out that the vehicles are located tightly in the photo, and someone else's axle has climbed onto our straight line, as in the last photo above. But this can be dealt with, because we process several bestshots per track. If the car is detected correctly in other frames and someone else's axles do not fit into the frame, the errors of individual calculations will be leveled out.

This is what the result for the track will look like:

As you now understand, the situation when all axes are clearly visible and detected is not that common. Therefore, it is important to adequately aggregate the count of axes.

Aggregation of results

When the attribute recognition results for all bestshots are ready, they need to be aggregated. And this is creativity again. Among all the predictions for K frames, you can choose one with the highest network confidence. You can average the network outputs for several frames or use the most frequent prediction. Or provide different strategies with different restrictions on K — the number of bestshots. There are many options.

We recognize attributes independently, so they can be used for cross-checking. For example, if we have determined with great confidence that there is a taxi on the track, then it is definitely not a police car (thanks, captain). And you can wind up a lot of such heuristics on the relationship.

Re-identification

The last special feature I would like to talk about, which I already mentioned above, is re-identification (ReID)This is a repeated detection of an object that was once in our field of vision.

I wrote that ReID is used for gluing tracks. For each track, we can accumulate an average descriptor of the appearance of this vehicle (vector of length D) and compare by this vector how similar the tracks are to each other. If we lost a track, and after five frames a new one appeared that looks like the lost one, it is logical to glue them and assign the same ID.

ReID can be used more globally. For example, to search for the same vehicle from other cameras, when the number is missing, not visible or fake.

In the screenshot above, on the left we see a query request, on the right – objects sorted by similarity, received from other cameras.

How do we select similar vehicles? We look at the distance between the request vehicle descriptor and the descriptors of all vehicles in the event database: the smaller it is, the more similar the objects are. Based on the distance, we rank the selection.

The correctly selected objects are highlighted in green in the visualization — the same vehicles as in the request, but taken from a different position. As you can see, there are also incorrect answers — these are the frames that are not currently highlighted. They depict other vehicles. In the case of white cars, it is really difficult to find the differences: we do not see their numbers, and they seem absolutely identical to us. But when marking, the assessors saw the numbers, so the marking itself is correct. We made a mistake when we said that the car in the request and the output is the same.

In general, for such a search, you can also use explicitly specified attributes: model, type, color.

Instead of a conclusion

In this post, I haven't mentioned LPR — License Plate Recognition. We continue to develop this area and now we can read numbers from more than 60 countries in different languages. There's a lot of interesting stuff there, too: building the pipeline itself, using synthetic data for training, fighting Domain Gap, selecting the optimal network architecture, recognizing individual numbers. My next post will be about that.

In the meantime, ask questions in the comments – I'll try to answer them all.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *