Creating a 3D mesh from an image with Python

A few years ago, generating a 3D mesh from a single 2D image was a difficult task. But today, thanks to the advancement of deep learning, many monocular depth estimation models have been developed that give an accurate estimate of the image depth map. With this map, you can create a mesh by performing a surface reconstruction. Details before the start of our Python full stack development course.


Monocular depth estimation is the task of estimating the depth value (distance from the camera) of each pixel for a single (monocular) RGB image. The result of this evaluation is a depth map, which is basically a matrix where each element corresponds to the predicted depth of the corresponding pixel in the input image:

Depth Map

Points on a depth map can be viewed as a set of points with coordinates along three axes. A map is a matrix, which means that each of its elements has components x, y (column and row, respectively), and z – the value of the predicted depth at the point (x, y). List of points (x, y, z) in the field of 3D data processing is called point cloud.

Point cloud. original file

You can start with an unstructured point cloud and get a mesh, that is, a three-dimensional representation of an object from many vertices and polygons [полигонов]. The most common type of mesh is a triangular mesh, which consists of many 3D triangles connected by common edges or vertices. In the literature, you will find several methods for obtaining a triangular mesh from a point cloud; the most popular are alpha shape¹, ball rotation² and Poisson surface reconstruction³. These methods are called surface reconstruction algorithms.

triangular grid. original file open3d

The procedure for creating a grid from an image in this tutorial consists of three steps:

  1. Depth Estimation: Using a monocular depth estimation model, a depth map of the input image is generated.
  2. Point cloud generation: The depth map is converted to a point cloud.
  3. Mesh Generation: Using a surface reconstruction algorithm, a mesh is generated from a point cloud.

To complete this procedure, you will need an image. If you don’t have it, download it here:

Bedroom. Image from NYU Depth V2

1. Depth estimation

The monocular depth estimation model of choice for this guide is GLPN⁴. You can get it at Hugging Face Model Hub using the library Transformers by Hugging Face.

To do this, install the latest version of Transformers from PyPI:

pip install transformers

The code below evaluates the depth of the input image:

import matplotlib
from matplotlib import pyplot as plt
from PIL import Image
import torch
from transformers import GLPNFeatureExtractor, GLPNForDepthEstimation

feature_extractor = GLPNFeatureExtractor.from_pretrained("vinvino02/glpn-nyu")
model = GLPNForDepthEstimation.from_pretrained("vinvino02/glpn-nyu")

# load and resize the input image
image ="image.jpg")
new_height = 480 if image.height > 480 else image.height
new_height -= (new_height % 32)
new_width = int(new_height * image.width / image.height)
diff = new_width % 32
new_width = new_width - diff if diff < 16 else new_width + 32 - diff
new_size = (new_width, new_height)
image = image.resize(new_size)

# prepare image for the model
inputs = feature_extractor(images=image, return_tensors="pt")

# get the prediction from the model
with torch.no_grad():
    outputs = model(**inputs)
    predicted_depth = outputs.predicted_depth

# remove borders
pad = 16
output = predicted_depth.squeeze().cpu().numpy() * 1000.0
output = output[pad:-pad, pad:-pad]
image = image.crop((pad, pad, image.width - pad, image.height - pad))

# visualize the prediction
fig, ax = plt.subplots(1, 2)
ax[0].tick_params(left=False, bottom=False, labelleft=False, labelbottom=False)
ax[1].imshow(output, cmap='plasma')
ax[1].tick_params(left=False, bottom=False, labelleft=False, labelbottom=False)

To work with GLPN, the Transformers library provides two classes: GLPNFeatureExtractor – for preprocessing the input data, and the model class – GLPNForDepthEstimation.

Due to the architecture, the output size of the model is:

output size. Image generated with codecogs

So the size image is changed so that the height and width are a multiple of 32, otherwise the model output will be smaller than the input. This is necessary because the point cloud will be rendered using image pixels, which requires the input image and output depth map to be the same size.

Monocular depth estimation models try to get high quality predictions near boundaries, so the output (output) are truncated in the center (line 33). To keep the same dimensions, also cropped in the center image (line 34).

Here are some predictions:

Bedroom depth forecast. Input image from NYU Depth V2

Game room depth prediction. Input image from NYU Depth V2

Office depth forecast. Input image from NYU Depth V2

2. Building a point cloud

The 3D rendering part will use Open3d⁵. This is probably the best Python library for this kind of task.

Install the latest Open3d from PyPI:

pip install open3d

The code below converts the estimated depth map into an Open3D point cloud object:

import numpy as np
import open3d as o3d

width, height = image.size

depth_image = (output * 255 / np.max(output)).astype('uint8')
image = np.array(image)

# create rgbd image
depth_o3d = o3d.geometry.Image(depth_image)
image_o3d = o3d.geometry.Image(image)
rgbd_image = o3d.geometry.RGBDImage.create_from_color_and_depth(image_o3d, depth_o3d, convert_rgb_to_intensity=False)

# camera settings
camera_intrinsic =
camera_intrinsic.set_intrinsics(width, height, 500, 500, width/2, height/2)

# create point cloud
pcd = o3d.geometry.PointCloud.create_from_rgbd_image(rgbd_image, camera_intrinsic)

An RGBD image is simply a combination of an RGB image and a corresponding depth image. Class PinholeCameraIntrinsic stores the so-called internal matrix of the camera. With this matrix, Open3D can create a point cloud from an RGBD image with the correct spacing between points. Leave the internal settings as they are. For more information, see additional resources at the end of the guide.

To visualize run this line:


3. Mesh generation

Among the various methods for this task that you will find in the literature, this one uses the Poisson surface reconstruction algorithm³: it usually gives better and softer results than others.

Using the algorithm from the Poisson point cloud obtained in the last step, this code generates a grid:

# outliers removal
cl, ind = pcd.remove_statistical_outlier(nb_neighbors=20, std_ratio=20.0)
pcd = pcd.select_by_index(ind)

# estimate normals

# surface reconstruction
mesh = o3d.geometry.TriangleMesh.create_from_point_cloud_poisson(pcd, depth=10, n_threads=1)[0]

# rotate the mesh
rotation = mesh.get_rotation_matrix_from_xyz((np.pi, 0, 0))
mesh.rotate(rotation, center=(0, 0, 0))

# save the mesh'./mesh.obj', mesh)

First, the code removes outliers from the point cloud. A cloud can contain noise and artifacts for various reasons. In this scenario, the model could predict some depths that differ too much from neighboring depths.

The next step is to evaluate the normal. A normal is a vector (naturally having magnitude and direction) perpendicular to a surface or object, and must be evaluated in order to be processed by the Poisson algorithm. For more information about these vectors, see the additional resources at the end of the guide.

Finally, the algorithm is executed. The level of detail of the grid is determined by the value depth. In addition to improving mesh quality, a higher depth value increases output sizes.

To visualize the grid, I advise you to download MeshLabbecause there are 3D visualization programs in b/w only.

Here is the final result:

Generated Mesh

Grid from a different angle

Since the final result varies depending on the value depthis a comparison of its various values:

Comparison of different depth values

Algorithm with depth=5 resulted in a 375 KB grid, depth=6 – to 1.2 MB, depth=7 – to 5 MB, depth=8 – to 19 MB, depth=9 – to 70, and depth=10 – to 86 MB.


Despite the use of one image, the result is quite good. By tweaking the 3D, you can achieve even better results. This guide cannot fully cover all the details of 3D data processing, so I encourage you to read other resources (listed below) to better understand all aspects.

Additional resources:

Thanks for reading. I hope you found the material useful.


[1] H. Edelsbrunner, and E. P. Mücke, Three-dimensional Alpha Shapes (1994)

[2] F. Bernardini, J. Mittleman, H. Rushmeier, C. Silva, and G. Taubin, [The ball-pivoting algorithm for surface reconstruction](http://the ball-pivoting algorithm for surface reconstruction) (1999)

[3] M. Kazhdan, M. Bolitho and H. Hoppe, Poisson Surface Reconstruction (2006)

[4] D. Kim, W. Ga, P. Ahn, D. Joo, S. Chun, and J. Kim, Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth (2022)

[5] Q. Zhou, J. Park, and V. Koltun, Open3D: A Modern Library for 3D Data Processing (2018)

[6] N. Silberman, D. Hoiem, P. Kohli, and Rob Fergus, Indoor Segmentation and Support Inference from RGBD Images (2012)

And we will teach you how to work with Python so that you can upgrade your career or become a sought-after IT specialist:

To view all courses, click on the banner:

Brief catalog of courses

Data Science and Machine Learning

Python, web development

Mobile development

Java and C#

From basics to depth

As well as

Similar Posts

Leave a Reply