Six degrees of freedom: 3D object detection and more

In computer vision, it is often necessary to work with two-dimensional images, and much less often with 3D objects. Because of this, many ML engineers feel insecure in this area: there are many unfamiliar words, it is not clear where to apply old friends Resnet and Unet. Therefore, today I would like to talk a little about 3D using the example of the problem of determining six degrees of freedom, which in some form is synonymous with 3D object detection. I will analyze one of the relatively recent works on this topic with some digressions.

My name is Arseniy, I work as an ML engineer and run a Telegram channel partially unsupervised… This article is based on my own video for Data Fest 2020, section “CV in the industry”.

First, let’s define what six degrees of freedom (6 DoF – degrees of freedom) are. Imagine a certain rigid (unchangeable, i.e., during transformation, all points will remain at the same distance from each other) object in a three-dimensional world. To describe its position relative to the observer, 6 measurements are needed: three will be responsible for rotations along different axes, and three more – for displacement along the corresponding axes. Accordingly, having these six numbers, we imagine how the object is located relative to some basis (for example, the point from which the photograph is taken). This task is classic for robotics (where is the object to be grabbed with a robotic arm?), Augmented reality (where to draw a mask in MSQRD, ears in Snapchat, or sneakers in Wanna Kicks?), Self-driving cars, and other domains.

We will consider the article MobilePose: Real-Time Pose Estimation for Unseen Objects with Weak Shape Supervision (Hou et al., 2020)… This article, written by authors from Google Research, offers a reliable and, importantly, fast pipeline for solving a problem, it would be appropriate to take it apart piece by piece.

The pipeline consists of three main pieces:

1. Backbone is quite classic, the hourglass architecture should be familiar to anyone who has taught Unet at least once.

2. Network outlets don’t look innovative. Detection, regression – all familiar words! However, questions may arise about shape. But let’s put them aside for a while.

3. Post-processing can seem mysterious to those not in the know. What is EPnP and why does it turn 2D points into a 3D bounding box?

3D for dummies

And here we need to make an important digression right away, which will help us answer all these questions. Let’s take a high-level look at some of the math of the 3D world. Let there be some set of 3D points `X` – matrix of size (n, 3), in which n is the number of points. As we remember, six degrees of freedom are three rotations and three displacements, i.e. Rigid transformation… If we denote `R` rotation matrix, and `t` – translation vector (rotation and translation, respectively), the following equation will be observed:

`X’ = X @ R.T + t `

`R` and `t` and there is what we want to find in this problem, they describe how to move and rotate our rigid object so that it is where we see it now.

But `X’` are still 3D coordinates. Therefore, it should be said that there still exists a certain projective matrix `P`… This matrix characterizes how we project an object onto a two-dimensional plane, conditionally “render” it. This matrix depends on the size of the photo, focal length, distortion, but in our problem it can be considered constant. Name of such a matrix, you can get the 2D coordinates of the points by simply multiplying it by `X’`:

`x = X’ @ P.T`

Quite on our fingers: we are looking for how to rotate and move a certain object so that some of its points are projected as they are now shown in the photograph. I have simplified everything to the point of indecency, therefore I send everyone who wants to be enlightened to look at CS231a

Finding subproblem `R` and `t`knowing `X`, `x` and `P`is called Perspective-n-Point… Those. we know how our object looks in 3D (this is `X`), we know to which image it is projected (`P`) and where on this image are its points. Looks like an optimization problem! There is a whole family of algorithms that solve this problem, for example, some already implemented in OpenCV

Monocular Model-Based 3D Tracking of Rigid Objects: A Survey (Lepetit et. Al 2005) – classic overview;
EPnP: An Accurate O (n) Solution to the PnP Problem (Lepetit et. Al 2008) – strong baseline;
PnP-Net: A hybrid Perspective-n-Point Network (Sheffer and Wiesel, 2020)– for those who want to cross a snake and a hedgehog, i.e. add some diplerning to PnP.

By the way, this problem is approached from the other side. Deep learning adepts can find many articles where a special projection layer is used that converts 2D and 3D points to each other. Typically, synthetic data is used to train such a layer. It is expensive and difficult to obtain 3D coordinates from the real world. An example of such an article

Where to get 3D points?

So, we need X – remember, this is a set of 3D points. Where to get it from?

The easiest option is to try to find the same object in all images. We take some kind of 3D CAD model (ready-made, draw from scratch, scan a real object with a special scanner …) and use it (more precisely, some of its points) as X. In other words, we make an explicit assumption “there is just such an object in the photo “- at first glance, this is downright impudent, but practice shows that this is enough to assess 6 DoF.

A more complex approach is the so-called parameterized models. An exemplary example is Basel face… Researchers at the University of Basel scanned many faces and trained the model using PCA so that changing a small number of its parameters would allow these 3D faces to be generated. Thus, you can twist a small number of handles – the main components and get quite different models.

A parameterized model can be much simpler. For example, if we are looking for a 3D bounding box in a photo, we can use a cube as a base model, and use its length-width-height ratios to parameterize it.

If our 3D model is parametrized, its parameters can be selected by different iterative methods and choose one with less reprojection error. Those. take some model `X`, we solve PnP, we get `R` and `t` and choose this `X`so that the difference `x` and `(X @ R.T + t) @ P` was minimal, for example, you can look at procrustes analysis.

True diplerners go further and in some form learn either a 3D model or its parameters. A good example is a famous work DensePose from Facebook Research, which popularized the learning approach dense coordinate maps… Those. the model predicts for each pixel its relative position on the 3D model. Then you can find a match and get for each pixel some approximation of its 3D coordinates.

In the article that we initially undertook to disassemble, there is an exit with the mysterious name shape. Depending on the availability of grouth truth data (more on this later), the authors either learn there a segmentation mask of the object (the so-called weak supervision, to improve convergence), or just a map of the coordinates of the object.

The question may also arise – the coordinates of which points do we want to find? The answer is simple: in fact, we don’t care. A 3D model usually consists of thousands of vertices, we can choose a subset as we like. The only more or less important criterion is the approximate equidistance of points from each other; in case of unsuccessful sampling, the PnP solution becomes unstable.

Where to get 2D points?

So, we have at least figured out the 3D object, let’s go to the area more familiar to most CV engineers, i.e. Let’s go back to 2D and think about where to get the coordinates on the plane.

To obtain 2D coordinates of points (usually this task is called keypoint detection) there are two main approaches: direct regression (the last layer of the network produces x, y for each point) and heatmap-regression (the last layer produces a heatmap with a spot near the point). The first approach may be faster because it is not necessary to build a complete encoder-decoder architecture, the second is usually more accurate and fairly easy to learn (this is almost the same segmentation).

The authors of MobilePose did not take any of these paths and came up with an interesting hybrid, inspired by modern “non-anchor” detection architectures like CenterNet… Remember the Detection and Regression heads on the diagram? So, Detection head predicts the center of the object, and Regression – where the vertices of the object are relative to this center.

There is a neat trick to this approach. I wrote a lot about how to choose a 3D model, and in this case everything is deliberately simplified: a simple parallelepiped is used as a model, it is also a 3D bounding box! That is, X is the vertices of the box, and x is the projection of these vertices. This means that it is enough to know the aspect ratios of this box (which we get from the detector architecture itself), and the rest of the magic is useless.

Features of data preparation

In all near-3D tasks, the data issue is even more painful than usual. For segmentation or detection, markup, although it takes a lot of time, already has many tools for it, the process is more or less clear and debugged, and there are plenty of existing datasets. Since 3D data is getting more complicated, everyone was tempted to use synthetic data and render an endless dataset. Everything would be fine, but models trained on synthetics without additional tricks usually show significantly worse quality than models trained on real data.

Unlike 2D object detection, it is prohibitive to manually label data for 3D detection. Due to this difficulty of collecting sufficiently large amounts of labeled training data, such approaches are typically trained on real data that are highly correlated with the test data (eg, same camera, same object instances, similar lighting conditions). As a result, one challenge of existing approaches is generalizing to test data that are significantly different from the training set.

Synthetic data is a promising alternative for training such deep neural networks, capable of generating an almost unlimited amount of pre-labeled training data with little effort. Synthetic data comes with its own problems, however. Chief among these is the reality gap, that is, the fact that networks trained on synthetic data usually do not perform well on real data.

In this article, the authors did one such trick: instead of mindlessly rendering the entire world, they made a combination of real and synthetic data. To do this, they took video from AR applications (it’s good to be Google and have a lot of data from ARCore). These videos include both plane knowledge, 6 DoF visual odometry score, and lighting score. This allows you to render artificial objects not anywhere, but only on flat surfaces, adapting the illumination, which significantly reduces the reality gap between synthetic and real data. Unfortunately, repeating this trick at home seems quite difficult.

Together

Hooray, we galloped through all the key pipeline concepts! This should be enough for the reader to be able to assemble from open source components, for example, an application that will draw a mask on the face (for this, you don’t even need to learn the models yourself, there are a lot of ready-made networks for face alignment).

Of course, this will only be a prototype. When bringing such an application to production, many more questions will arise, for example:

• How to achieve consistency between frames?

• What to do with non-rigid objects?

• What if the object is partially invisible?

• What if there are many objects in the frame?

This is where the real adventure begins.