Creating a 3D mesh from an image with Python
A few years ago, generating a 3D mesh from a single 2D image was a difficult task. But today, thanks to the advancement of deep learning, many monocular depth estimation models have been developed that give an accurate estimate of the image depth map. With this map, you can create a mesh by performing a surface reconstruction. Details before the start of our Python full stack development course.
Monocular depth estimation is the task of estimating the depth value (distance from the camera) of each pixel for a single (monocular) RGB image. The result of this evaluation is a depth map, which is basically a matrix where each element corresponds to the predicted depth of the corresponding pixel in the input image:
Points on a depth map can be viewed as a set of points with coordinates along three axes. A map is a matrix, which means that each of its elements has components
y (column and row, respectively), and
z – the value of the predicted depth at the point
(x, y). List of points
(x, y, z) in the field of 3D data processing is called point cloud.
Point cloud. original file Open3D
You can start with an unstructured point cloud and get a mesh, that is, a three-dimensional representation of an object from many vertices and polygons [полигонов]. The most common type of mesh is a triangular mesh, which consists of many 3D triangles connected by common edges or vertices. In the literature, you will find several methods for obtaining a triangular mesh from a point cloud; the most popular are alpha shape¹, ball rotation² and Poisson surface reconstruction³. These methods are called surface reconstruction algorithms.
triangular grid. original file open3d
The procedure for creating a grid from an image in this tutorial consists of three steps:
- Depth Estimation: Using a monocular depth estimation model, a depth map of the input image is generated.
- Point cloud generation: The depth map is converted to a point cloud.
- Mesh Generation: Using a surface reconstruction algorithm, a mesh is generated from a point cloud.
To complete this procedure, you will need an image. If you don’t have it, download it here:
Bedroom. Image from NYU Depth V2
1. Depth estimation
The monocular depth estimation model of choice for this guide is GLPN⁴. You can get it at Hugging Face Model Hub using the library Transformers by Hugging Face.
To do this, install the latest version of Transformers from PyPI:
pip install transformers
The code below evaluates the depth of the input image:
import matplotlib matplotlib.use('TkAgg') from matplotlib import pyplot as plt from PIL import Image import torch from transformers import GLPNFeatureExtractor, GLPNForDepthEstimation feature_extractor = GLPNFeatureExtractor.from_pretrained("vinvino02/glpn-nyu") model = GLPNForDepthEstimation.from_pretrained("vinvino02/glpn-nyu") # load and resize the input image image = Image.open("image.jpg") new_height = 480 if image.height > 480 else image.height new_height -= (new_height % 32) new_width = int(new_height * image.width / image.height) diff = new_width % 32 new_width = new_width - diff if diff < 16 else new_width + 32 - diff new_size = (new_width, new_height) image = image.resize(new_size) # prepare image for the model inputs = feature_extractor(images=image, return_tensors="pt") # get the prediction from the model with torch.no_grad(): outputs = model(**inputs) predicted_depth = outputs.predicted_depth # remove borders pad = 16 output = predicted_depth.squeeze().cpu().numpy() * 1000.0 output = output[pad:-pad, pad:-pad] image = image.crop((pad, pad, image.width - pad, image.height - pad)) # visualize the prediction fig, ax = plt.subplots(1, 2) ax.imshow(image) ax.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False) ax.imshow(output, cmap='plasma') ax.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False) plt.tight_layout() plt.pause(5)
To work with GLPN, the Transformers library provides two classes:
GLPNFeatureExtractor – for preprocessing the input data, and the model class –
Due to the architecture, the output size of the model is:
output size. Image generated with codecogs
So the size
image is changed so that the height and width are a multiple of 32, otherwise the model output will be smaller than the input. This is necessary because the point cloud will be rendered using image pixels, which requires the input image and output depth map to be the same size.
Monocular depth estimation models try to get high quality predictions near boundaries, so the output (
output) are truncated in the center (line 33). To keep the same dimensions, also cropped in the center
image (line 34).
Here are some predictions:
Bedroom depth forecast. Input image from NYU Depth V2
Game room depth prediction. Input image from NYU Depth V2
Office depth forecast. Input image from NYU Depth V2
2. Building a point cloud
The 3D rendering part will use Open3d⁵. This is probably the best Python library for this kind of task.
Install the latest Open3d from PyPI:
pip install open3d
The code below converts the estimated depth map into an Open3D point cloud object:
import numpy as np import open3d as o3d width, height = image.size depth_image = (output * 255 / np.max(output)).astype('uint8') image = np.array(image) # create rgbd image depth_o3d = o3d.geometry.Image(depth_image) image_o3d = o3d.geometry.Image(image) rgbd_image = o3d.geometry.RGBDImage.create_from_color_and_depth(image_o3d, depth_o3d, convert_rgb_to_intensity=False) # camera settings camera_intrinsic = o3d.camera.PinholeCameraIntrinsic() camera_intrinsic.set_intrinsics(width, height, 500, 500, width/2, height/2) # create point cloud pcd = o3d.geometry.PointCloud.create_from_rgbd_image(rgbd_image, camera_intrinsic)
An RGBD image is simply a combination of an RGB image and a corresponding depth image. Class
PinholeCameraIntrinsic stores the so-called internal matrix of the camera. With this matrix, Open3D can create a point cloud from an RGBD image with the correct spacing between points. Leave the internal settings as they are. For more information, see additional resources at the end of the guide.
To visualize run this line:
3. Mesh generation
Among the various methods for this task that you will find in the literature, this one uses the Poisson surface reconstruction algorithm³: it usually gives better and softer results than others.
Using the algorithm from the Poisson point cloud obtained in the last step, this code generates a grid:
# outliers removal cl, ind = pcd.remove_statistical_outlier(nb_neighbors=20, std_ratio=20.0) pcd = pcd.select_by_index(ind) # estimate normals pcd.estimate_normals() pcd.orient_normals_to_align_with_direction() # surface reconstruction mesh = o3d.geometry.TriangleMesh.create_from_point_cloud_poisson(pcd, depth=10, n_threads=1) # rotate the mesh rotation = mesh.get_rotation_matrix_from_xyz((np.pi, 0, 0)) mesh.rotate(rotation, center=(0, 0, 0)) # save the mesh o3d.io.write_triangle_mesh(f'./mesh.obj', mesh)
First, the code removes outliers from the point cloud. A cloud can contain noise and artifacts for various reasons. In this scenario, the model could predict some depths that differ too much from neighboring depths.
The next step is to evaluate the normal. A normal is a vector (naturally having magnitude and direction) perpendicular to a surface or object, and must be evaluated in order to be processed by the Poisson algorithm. For more information about these vectors, see the additional resources at the end of the guide.
Finally, the algorithm is executed. The level of detail of the grid is determined by the value
depth. In addition to improving mesh quality, a higher depth value increases output sizes.
To visualize the grid, I advise you to download MeshLabbecause there are 3D visualization programs in b/w only.
Here is the final result:
Grid from a different angle
Since the final result varies depending on the value
depthis a comparison of its various values:
Comparison of different depth values
depth=5 resulted in a 375 KB grid,
depth=6 – to 1.2 MB,
depth=7 – to 5 MB,
depth=8 – to 19 MB,
depth=9 – to 70, and
depth=10 – to 86 MB.
Despite the use of one image, the result is quite good. By tweaking the 3D, you can achieve even better results. This guide cannot fully cover all the details of 3D data processing, so I encourage you to read other resources (listed below) to better understand all aspects.
Thanks for reading. I hope you found the material useful.
 H. Edelsbrunner, and E. P. Mücke, Three-dimensional Alpha Shapes (1994)
 F. Bernardini, J. Mittleman, H. Rushmeier, C. Silva, and G. Taubin, [The ball-pivoting algorithm for surface reconstruction](http://the ball-pivoting algorithm for surface reconstruction) (1999)
 M. Kazhdan, M. Bolitho and H. Hoppe, Poisson Surface Reconstruction (2006)
 D. Kim, W. Ga, P. Ahn, D. Joo, S. Chun, and J. Kim, Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth (2022)
 Q. Zhou, J. Park, and V. Koltun, Open3D: A Modern Library for 3D Data Processing (2018)
 N. Silberman, D. Hoiem, P. Kohli, and Rob Fergus, Indoor Segmentation and Support Inference from RGBD Images (2012)
And we will teach you how to work with Python so that you can upgrade your career or become a sought-after IT specialist:
To view all courses, click on the banner:
Data Science and Machine Learning
Python, web development
Java and C#
From basics to depth
As well as