How Machine Vision for Autonomous Transport Has Evolved. Yandex Report

Today I'll talk about the evolution of machine vision in autonomous vehicles: what sensors help our cars perceive the world around them and what exactly they see.

ML in autonomous transport

Let's get back to the objects around us: in order to interact with them, we first need to be aware of them. In machine learning terms, this is called the “3D detection task.” To solve this problem, we work with three-dimensional objects, which we divide into two large groups:

  • Agents — objects that actively participate in traffic and can take our behavior into account. For example, people, cars or motorcycles.

  • Static obstacles — objects that participate in road traffic passively and cannot interact with us of their own free will. Nevertheless, they still need to be taken into account.

3D Detection Task: Find the Agent and the Static Obstacle

3D Detection Task: Find the Agent and the Static Obstacle

Hidden text

The motion planning system builds a route for an autonomous car: from the starting point to the end point. To work with the route, it is convenient to interpret it as a graph.

Remember how the navigator interface looks like for drivers or pedestrians? The terrain and the route laid through it are displayed in the “Top View” projection. This is also suitable for autonomous transport – planning from three-dimensional is slightly simplified to two-dimensional.

The planning itself is described in detail in one of the previous articles – “Neural networks for planning the movement of unmanned vehicles”. And now we will talk about what exactly helps the car understand what is in front of it – airplane, bird, agent or static, and decide how to move forward.

The first stage of evolution: cameras and the emergence of lidar

Today's set of sensors and software is the third stage of the evolution of the “vision” of our autonomous transport. And the first stage began with cameras.

Cameras

In fact, this is the simplest way to implement technical vision for an autonomous car – put a camera on it and run, for example, segmentation.

Example of segmentation

Example of segmentation

It is noticeable that the road surface and static obstacles: sidewalks and curbs are recognized well in this way. But, as I said earlier, for planning it is necessary to convert all images to a top view.

To obtain a projection from above, you need a 3D representation of the surface, but we didn't have one. The simplest thing that could be obtained quickly was a homography. Homography is a geometric transformation that reprojects one plane onto another. For our purposes, we need to reproject the camera plane onto the road plane.

Notice how nice the safety triangle is in the picture on the left.

Notice how nice the safety triangle is in the picture on the left.

As a result of this transformation, all objects in the picture that relate to the road retain their linear dimensions in its plane – take a closer look at the safety triangle in the picture above.

But the plane of the road is poorly interpreted by only one plane: the further the object is from the camera, the more distortions accumulate. Therefore, homography out of the box is good for recognizing objects that are within a radius of no more than 40 meters from the camera. This field of view is acceptable for driving at a speed of about 20 km / h. This is enough for an autonomous car to drive in a closed area.

Depth from stereo pair

To improve reprojection, we decided to add depth prediction from the image and introduced a stereo pair into the system.

Stereo pair: expectations vs reality

Stereo pair: expectations vs reality

Expectations: excellent depth map, improved reprojection of static obstacles onto the top view, 60 km/h – no problem.

Reality: None of the algorithms available at the time helped us improve the quality of reprojection. Models trained on indoor data were not ready for real weather and live cameras.

To retrain the depth model, we decided to add a new sensor that would allow us to collect ground-truth data for depth.

Lidar

The good thing about lidar is that it can determine the distance to an object with an accuracy of up to a centimeter. We hoped that if we reprojected the lidar data onto the camera image, we would get ground truth for depth.

And again, reality did not match expectations. Many of our CV and ML engineers believed that a fully autonomous car could be made in two years using only cameras. In the end, we were unable to significantly improve static obstacles using only cameras. And the lidar allowed one developer to make a static detector in two weeks, comparable in quality to the one that a crowd of specialists worked on for 4 months.

So, let's sum up the first stage of evolution. On the one hand, lidar is an excellent source of ground-truth data for depth by cameras. On the other hand, lidar itself is an excellent source of data on static obstacles. With its help, you can close this task and move on to others.

The Second Stage of Evolution: 3D Detectors and Sensor Locations

To begin with, I will dwell in more detail on the sensors with which technical vision can be implemented.

The sensor suite of our autonomous transport

The sensor suite of our autonomous transport

Our autonomous cars are equipped with many different sensors. For example, cameras are installed so that the car's field of vision reaches 360 degrees, with no blind spots.

So, in the first approximation, an autonomous car sees the world with the help of cameras. And lidars allow it to more accurately estimate the distance to objects.

Here's an example scene from our visualization – do you recognize the car driving towards our autonomous vehicle in the cloud of gray dots in the images on the left? Lidars are so good that even without visualization, you can tell what's going on just from the cloud.

3D detectors

There are cameras, there are lidars, there is data from them. How to combine all this? With the help of ML-CV models working with 3D.

One of the classic approaches to working with lidar clouds is the PointNet architecture. However, it does not cope with real-time segmentation of large lidar clouds.

PointNet - The standard for lidar cloud segmentation

PointNet – The standard for lidar cloud segmentation

We called our first 3D detector YaBoxnet.

YaBoxnet operation scheme

YaBoxnet operation scheme

We imagine the lidar cloud as a top view, cut it into cubes, which we call voxels, and run the same PointNet in each of them. We extract some features from each cube. And then, if you look at the result of processing the lidar cloud from above, it can be interpreted as a picture – only instead of RGB channels we use voxels and those same features that we extracted using PointNet.

And then you just need to launch YOLO, SSD or any other 2D detector, and — whoosh! — you get a detector that can colorize a picture and make 2D crops.

YaBoxnet does not use the camera image, so it loses most of the information. A parallel path where this image is used (and which we also address) is frustum detectors.

Frustum Detector: SSD + PointNet

Frustum Detector: SSD + PointNet

How it works: we launch a standard 2D detector on the image, we get a 2D rectangle with an object in it. We take the sector that this rectangle rests on, launch PointNet in it and extract 3D objects.

Unfortunately, both approaches have drawbacks:

  • YaBoxnet works well with top views and estimates the angles of objects. However, since it has no information about the picture, it can easily confuse, for example, a fir tree with two branches sticking out to the right and left, with a person. Or it can’t tell a square bus stop from a bus of the same shape.

  • Frustum works better with objects that have some kind of cluster, like a person. But it's worse at estimating angles and distorts car detection.

LaserNet

We wanted to get a more universal detector that would solve the problem of 3D detection and at the same time have the advantages of both approaches. We were inspired by one of the existing solutions – LaserNet.

Operating principle of LaserNet

Operating principle of LaserNet

LaserNet works something like this: it takes an image from the cameras and reprojects it into a lidar. The camera and lidar data are concatenated. The result is a 2D representation on which detection and segmentation are run.

We figured out how to improve this scheme. Let's imagine that there is an infinite sphere around the car, onto which all cameras and lidars can be projected.

Infinite Radius Sphere in Action

Infinite Radius Sphere in Action

Let's look at the first diagram. There is a camera. And there is a ray passing through some pixel of this camera (the first red square). It hits a pixel of the infinite sphere (the second red square). In this way, we were able to transfer data from all cameras to the infinite sphere.

It's the same with lidar: each point it returns is also a ray that can be projected onto an infinite sphere (blue square).

And then there is a standard scheme: a camera encoder/decoder, into the middle of which we add data from the lidar to enrich 2D detections. At the same time, if X, Y and Z remain in the lidar data, then they, together with the features at the decoder output, can be reprojected onto the top view and get an encoder/decoder similar to Boxnet, but with camera features.

Once again our expectations were shattered by reality.

Unfortunately, a sphere of infinite radius is a utopia if the sensors are not positioned correctly.

The places where the panorama was glued together well are marked in the picture above with green arrows. In other areas, everything is just terrible: pay attention to the blue arrows. On the right, the sidewalk did not converge, on the left, the pedestrian crossing. Such artifacts occur when the cameras are too far from each other: for reprojection into one panorama, the focal points of all cameras must be close.

Since we developed our own sensor suite, we were able to position the cameras closer to the lidar to minimize reprojection errors.

Lidar and cameras

Lidar and cameras

So we have moved on to the third stage of evolution. But first, let us sum up the second.

Fusing data of related modalities is not an easy task. It can be solved with machine learning, complex models, but why? If you develop a sensor set yourself, you can arrange sensors in a way that simplifies the architecture of the ML model, and therefore helps it converge and learn better.

The third stage of evolution: GorynychNet

The architecture we discussed above has significant drawbacks:

  • All cameras must be located close to the lidar.

  • In areas where cameras overlap, there is a signal from only one camera.

  • In the lidar blind spot, camera data is useless.

The next step in evolution is GorynychNet.

GorynychNet

We abandoned one panorama in favor of several: for each lidar, we generate our own infinite sphere and reproject data from the camera closest to this lidar onto it. This turned out to be enough: we managed to improve recognition in the near zone of the car. For example, now we can clearly see people getting into it.

Back to the roots

Lidars are great, of course. But we were haunted by the question: is it possible to get by with just cameras?

Attempt number one: we launched mono-depth on all cameras, which allowed us to condense the data to 30 meters and even solve some problems with it. But then we got our hands on a better lidar, which solved the same problems much more simply.

A year or a year and a half passed. We figured out how to compact the lidar with a camera for a zone of up to 60 meters. We solved some problems again. And again a new lidar appeared that solved these problems better.

It would seem that it does not fly. But, as the head of lidar development said, this will not happen again: it is not possible to increase the speed of light yet. Therefore, with a zone of 120 meters, you can now only work with cameras. But this is not the only case when you cannot do without this approach.

Another important point is safety guarantees. Let's imagine that one of the autonomous car's systems has failed. For such cases, there should be a backup that will allow the car to be parked safely.

To ensure security, sensors must be redundant. The architectures we discussed above work well in two cases:

  • there is data from both lidar and cameras;

  • there is at least data from the lidar (for example, you won’t get much data from cameras at night).

But everything will break if the lidar fails, which means we'll have to rely solely on cameras to ensure safety.


Let me sum it up:

  • Initially, we weren't able to do everything with just cameras.

  • Lidar has proven to be a great sensor that has accelerated our development.

  • Correct placement of sensors also helped improve technical vision.

But unfortunately, to make a truly secure product, we still had to go back to the camera-only approach.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *