How to Reconstruct a 3D Scene from a Set of Images

Hello everyone, in this article we will look at a method such as Atlas, what it is, consider the basic concepts and principle of operation.

I present the method 3D scene reconstruction (the process of creating a three-dimensional model of an object based on a two-dimensional image or video), which is based on linear regression, truncated signed distance function (TSDF) (We will consider it in the next paragraph) from a set of RGB images with given positions (for each image the camera parameters are known, including its position and orientation in space). Typically, 3D reconstruction approaches rely on depth maps before estimating the 3D scene. We assume that direct regression in 3D is more effective. 2D-CNN extracts features from each image independently, which are then projected and accumulate (collected) in a voxel volume using internal (Focal length, Projection center, Distortion factors) and external (Position, Orientation) camera parameters. Afterwards, 3D-CNN refines the accumulated features and predicts TSDF values.

TSDF

If you are familiar with this principle, then you can skip this point and move on to the main part of the article. If not, then TSDF (Truncated Signed Distance Function) is for your attention. However, before we discuss it, let's talk about SDF (Signed Distance Function). First of all, this is a function that describes the distance from a point to a certain surface or object in three-dimensional space.

Let's denote:

Hidden text

Euclidean metric d=\sqrt{\left(x_2-x_1\right)^2+\left(y_2-y_1\right)^2}

And we also have a definition:

d\left(x,\delta\Omega\right)=\inf_{y\in\delta\Omega}{d}\left(x,y\right)

That is, we will look for the shortest distance to the object by comparing sets of points of the object.

Then we define

If the point is inside the object, the function assigns a positive value, and negative if outside.

We figured this out well, but we have TSDF, which does not complicate the task too much, because we simply cut off part of the data, and let’s say it works in the range [-1;1]

Where sng(x) is the sign function:

Introduction

The figure shows a graphical representation of the data processing method. On the left is the **Feature Volume**, where features from images are projected and accumulated into a volume. In the middle is the **TSDF Volume**, which goes through a refinement process using a 3D CNN (Convolutional Neural Network). On the right is a **Labeled Mesh** created from a TSDF, which can also contain semantic labels. The figure illustrates the sequence of data processing from features to the final model.

In this work, we note that depth maps are often intermediate representations that are then fused with other depth maps into a complete 3D model. In this regard, we propose a method that takes a sequence of RGB images and directly predicts the full 3D model in the learner. end-to-end mode (the model is trained simultaneously, without dividing into separate stages from start to finish). This allows the network to fuse more information and better learn geometric priorities about the world, resulting in significantly better reconstructions. Moreover, it reduces system complexity by eliminating steps such as frame selection, and also reduces the required computational cost by distributing it across the entire sequence.

The method is based on two main directions: multi-view stereobased on volume cost (in a word, we simply create a depth map) and refinement of the truncated signed distance function (TSDF). Multi-view stereo uses cost volume using flat scan to create 3D models from 2D images taken from different angles. Here reference image (that is, we will start from it) is deformed onto the target image (that is, we will change our reference image in every possible way so that it better matches the shape and angles of the target image) for each of a fixed set of depth planes and adds up to a 3D volume of value. The depth is calculated by taking the minimum argument over the planes. This procedure is made more robust by extracting image features using a CNN and filtering the cost volume using another CNN before accepting the minimum argument.

TSDF refinement begins by merging depth maps from the depth sensor into an initial voxel volume using TSDF fusion, where each voxel stores a truncated signed distance to the nearest surface. Note that from this implicit representation, a triangulated mesh can be extracted by finding the zero-intersection surface (object boundary detection) using the marching cubes algorithm. TSDF refinement methods take this noisy and incomplete TSDF as input and refine it by passing it through a 3D convolutional encoder-decoder network.

Method

Our method takes as input a sequence of RGB images of arbitrary length, each of which has known camera parameters and position. These images are passed through a 2D CNN to extract features. Features are then projected into a 3D voxel volume and accumulated using a moving average. Once the image features have been combined into 3D, we regress the TSDF directly using a 3D CNN (see figure). We are also experimenting with adding an additional output layer to predict semantic segmentation.

Schematic of the method using 2D and 3D CNNs for image sequence analysis.

Schematic of the method using 2D and 3D CNNs for image sequence analysis.

– First, features are extracted from images using a 2D CNN, which are then projected into a 3D volume.

– These volumes are accumulated and passed through a 3D CNN for TSDF (Truncated Signed Distance Function) regression and joint prediction of 3D semantic segmentation of the scene.

Construction of the volume of features

Let (\I_t\ \in\ R^{\mathbb{3}\ \times\ h\ \times\ w}\ ) is an image in a sequence of t RGB images. Where I_t this is a specific image from \mathbb{3} channels (red, green, blue) h – height and w – width. We extract features (F_t=F\left(I_t\right)\in R^{c\times h\times w}) using a standard 2D convolutional neural network, where F_t this is a feature extraction function, and (\c\) is the dimension of features (number of channels). These 2D features are then projected into a 3D voxel volume, given known camera parameters. Consider the volume of voxels (V\in R^{c\times H\times W\times D}) where D is the depth:

V\left(t,i,j,k\right)=F_t\left(t,\hat{i},\hat{j}\right)with

where t is the frame number or mark in the image sequence. (\i) And (\j\) these are coordinates in voxel space that correspond to two-dimensional projections of width and height. They represent the position of a voxel in a specific layer of the volume. A (\k\): This is a depth coordinate that specifies the layer in the voxel volume. AND (\hat{i}) And (\hat{j}): These are 2D projections that correspond to 2D coordinates in the image. These coordinates determine which pixels from the image (I_t) are used to fill voxels.

Where (P_t) And (K_t)are the external (determines where the camera is located and how it is directed) and internal (determines how the camera “sees” the world) matrices for the image

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *