How to mark 3D cuboids on 2D images in CVAT? Methods of geographic information systems in data tagging

The Data Light team regularly encounters non-standard tasks, and last year we started working on one of them: our project unexpectedly grew from the usual marking of LiDAR clouds (images from special scanners) to writing scripts and creating non-standard solutions for CVAT.

In this article, I, Alexey Antyushenya, want to tell you how we found this unusual solution and share a method that will allow ML specialists and niche colleagues to solve complex 3D markup problems.

In this case, we are talking about cuboids identifying three-dimensional objects in 2D images to detect the object and determine its dimensions relative to other objects and structures around.

This type of marking is distinguished by the most complete provision of information about the object: height, width, length, distance to buildings, people and other vehicles. It is 3D marking that minimizes object distortion and gives a more correct idea of ​​the object’s location, in contrast to the bounding box.

Today, neural networks trained on such data automate the assessment of traffic violations, control traffic on the roads, reproduce AR landscapes, build routes on maps while preserving the terrain and geometries of surrounding objects, and help with unmanned parking.

The data for such marking is usually both LiDAR clouds and ordinary images/videos from outdoor surveillance cameras. In the second case, a set of frames of the same object from all sides is necessary so that you can see the dimensions of the object and place it in a cuboid.

Statement of the problem and first problems

Let’s imagine that you and I have received the original raw images and we need to make markings in order to train the neural network to recognize cars along three coordinate axes. It should have looked like this:

After getting acquainted with the technical specifications and data for marking, the stage of developing a technical solution begins, determining the methodology and principles of working with data. Here I highlighted 3 key points:

  1. Find the optimal marking method to take into account all cases of image distortion.

Most shots are taken at 270 or 360 degrees. If you do not take into account distortions/angles/rotations, then there is a high risk of incorrectly conveying the geometry of the object when marking. After constructing a cuboid, part of the object can be excluded from the marking, or the cuboid can go beyond the boundaries of the object and reflect a large space from the object to the lines of the cuboid. Both errors are fatal when determining dimensions.

  1. Select universal marking methods suitable for all types of vehicles and directions of movement in relation to the camera.

There are 4 main types of vehicles: cars, buses, trolleybuses and trucks. In our project we worked with cars, trucks and buses. But the technique described in this article is suitable for any objects, regardless of their size, shape and volume.

  1. Develop a working methodology for marking cars in traffic when some of the key points are blocked and you need to “complete the image” in your mind.

This may be reminiscent of solving stereometry problems at school. Some might argue that not everyone is good at working with spatial thinking.

Example of marking: we do not see the back lower right point and based on the data and the bottom line we figure out the correct location of the point

Indeed, not everyone can imagine a 3D figure on a plane. Therefore, when forming a team for the project, we conducted general training for data labelers. Based on the results of testing on basic markup examples, those who showed spatial thinking abilities were selected. With them, we conducted an in-depth analysis of cases and individual work on errors to submit the project with reference markings.

Here one might think that all the preparatory work has been completed: the team has been trained, the metrics have been taken, the markings are being completed. But a problem arose.

What to do if the project does not meet the pilot

After the metrics were approved based on the results of the pilot project, we received new images that were very different from the original ones. They did not perfectly show the car, located clearly in the middle of the axis. Instead, streams of cars appeared, filmed with Fisheye cameras, tilted in all planes and variations. Of course, the classic smooth CVAT cuboids are no longer suitable for us.

Example of data in a pilot Example of data from a main project

Limitations of CVAT and other tools

The entire plan we had drawn up for working on the project became irrelevant due to one limitation that was not noticeable at first glance: In CVAT, the vertical edges are exactly parallel to the sides.

This is not necessary, this is how it should be

Moreover, the same photo appeared in the pilot, but then we turned it over to the desired plane and simply marked it. As a single case this would be acceptable, but with a larger volume of tasks we would not be able to adhere to the stated metrics.

Finding a solution: what didn’t work for us?

Our first thought was to simply change the instrument. Together we looked at other tools for 3D marking on 2D images, namely:

  1. Super Annotate

  2. Label Studio

  3. Supervisely

  4. Hasty

But a comprehensive research showed that at that time, not a single tool on the market supported custom cuboid rotations. We had to consider other options: changing the method or markup type.

In the first case, it was possible to split the images into parts, independently rotate each part and mark it with straight edges. This was the only solution that took into account the built-in CVAT limitations.

Source frame

Cropped and rotated areas of interest for marking from the main frame

Completing the task using this method threatened to derail the project, since splitting the images increased the number of frames and the meaning of the frame for training the model was lost. It would also take three times as long because there would be more steps:

  1. First, it was necessary to determine the area for which the turn was needed. Sometimes it's 5-6 areas on one frame.

  2. Afterwards there was cutting and rotation of frames.

  3. Finally, markings.

It sounded complicated and low-tech. The second method turned out to be more interesting, although it is less common.

Second attempt: coordinate lines

Another option was marking with three coordinate lines for subsequent drawing of the 3D cubeoid by the developers. The guys suggested this idea to us on githubwho also encountered the problem of rotating 3D images.

An example of the final marking using the coordinate method (look closely, there is marking)

Both versions with technical specifications and markup examples were sent by ML to the customer's team for study.

Geoinformatics! What is this?

Let’s imagine that together with you we are waiting for the engineers’ answer, and we have time to figure out what these lines are and how they replace cuboids. In short, it looks like this:

Interestingly, the roots of this method go back to the science that deals with graphic visualization of spatial data and related information – geoinformatics. And simplifying a spatial object through geometric objects is part of geoprocessing. And one of the main elements of GIS (geographic information systems) is a polyline.

In our case, the logic was completely identical. We assign geotags to vehicles, that is, points and polylines, resulting in a simplified spatial object. Then we send the ML images to specialists to restore the cuboid using Euler angles. This is how we get visualization as in the example above. However, we checked with the customer whether they are ready to take on the task of completing the drawing of the cuboids using this technology after receiving an iteration of the data from us?

The client approved the GIS solution, but without enthusiasm, because there were difficulties in restoring the markup, which does not contain a unified system. The next step was to achieve a stable sequence of points with the same location for all objects.

Below is a combination of elements suitable for the project, according to the customer.

At this stage, we already realized that we were going in the right direction and only one step separated us from the final technical solution – setting up the combination. With renewed vigor we began marking.

Unfortunately, the customer's point combination was suitable for a minority of images. The cars in the images were blocked by other objects, and when making a non-standard turn, some of the marking points disappeared. As we found out, putting more than one dot on our own significantly reduced the quality of the markup, so we needed to look for more combinations.

4 combination method: what is it?

Then I started trying out combinations of point locations so that they would fit all vehicles, all directions of movement of cars, and all options for tilting and turning cars towards the camera. Over time, we managed to reach stable 4 combinations, which covered absolutely all marking cases. On the screenshots you can see each:

The customer reacted positively to the solution with four stable combinations, conducted tests on visualizing cuboids and told us that he would be happy to continue working on this principle. It was a real victory. We call this approach the method of four combinations.

We successfully completed the first iteration of 10,000 lines (or “partial cuboids” as the customer called them) with four combinations using developments for geographic information systems.

A side effect of our solution was that markup became cheaper. It is much faster and easier to draw three lines than to perform a full-fledged 3D marking with selected marking tools, because the simpler the marking, the lower the price. Thanks to this, the customer freed up resources to add classifications by color and body type to the technical specifications, which greatly influenced the training of the model.

Both we and the customer himself were so inspired by the speed, quality and affordable price that we began working on even more distorted data from Fisheye cameras. We even worked with complex cases.

So, we not only helped the customer’s ML team solve a complex problem, found a new win-win method, but also significantly reduced the cost of marking. This allowed the client to give us more data and completely close their markup gap. By the way, with this solution we were able to mark 65,000 cars.

What did this teach us?

This case allowed me to draw 4 important conclusions that I would like to share with you:

  1. Never give up in finding solutions, even if it seems that everything is against you.

  2. Think bigger, expand your horizons, go beyond your niche. You never know what science will lead to the right decision. Before you give up, explore all the forums and consider different branches of science.

  3. Communicate with your customers, solve problems together. Work as a team, not just as a disinterested service salesman.

  4. Use the GIS method of 4 combinations for similar cases or find your own unique solution.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *