Convert Minecraft worlds into 3D photorealistic scenes using neural networks

We present GANcraft, a non-human neural rendering framework for generating photorealistic images of large worlds from 3D blocks, such as those created in Minecraft. Our system receives as input a world of semantic blocks, in which each block is assigned a label, for example, “earth”, “tree”, “grass”, “sand” or “water”.

We define the world as a continuous volumetric function and train our model to render photorealistic images while preserving the appearance from arbitrary points of view without the presence of real images of the blocky world.

Besides the camera position, GANcraft allows the user to define the semantics and style of the scene.

Output from our model. In the lower left corner, block worlds are shown as input.

Short description video

Technology overview

What specific problem is GANcraft trying to solve?

GANcraft aims to meet the challenge of transforming one world into another. With a semantically labeled world, such as the one used in the popular game Minecraft, GANcraft is able to transform it into a new world that has the same structure, but with added photorealism. After that, the new world can be rendered from arbitrary points of view, getting photorealistic videos and images that retain their appearance when changing points of view. GANcraft simplifies the process of 3D modeling complex landscape scenes that would take years of practice to create.

Basically, GANcraft turns any Minecraft player into a 3D artist!

Question: “Why not just use the im2im transform?

Since there are no original true photorealistic renders for the user-created block world, we have to train the models using indirect control.

Some existing solutions are good candidates. For example, you can use image-to-image (im2im) conversion techniques such as MUNIT and SPADE, originally trained only on 2D data, to convert frame-by-frame segmentation masks projected from a blocky world into realistic looking images.

You can also use wc-vid2vid – a technique that takes into account 3D-volume, for generating images with a constant appearance using 2D-retouching and 3D-distortion, using voxel surfaces as 3D geometry.

These models need to be trained by converting real segmentation maps to real images, and then applied in Minecraft.

Another alternative is to train NeRF-Wstudying the 3D radiation field from a non-photometric but positioned and 3D immutable collection of images. It can be trained using predicted images from im2im (pseudo-true data, which will be discussed in the next section), that is, data that is as close to the requirements as possible.

From left to right: MUNIT, SPADE, wc-vid2vid, NSVF-W (NSVF + NeRF-W), GANcraft (our method)

By comparing the results of different methods, we can immediately replace some of the problems:

  • Im2im techniques like MUNIT and SPADE do not preserve the appearance of objects regardless of the point of view, because they do not have knowledge of 3D geometry and each frame is generated separately.
  • wc-vid2vid creates a video with a constant appearance of objects, but over time the image quality deteriorates quickly due to the accumulation of block geometry errors and insufficient data for test training.
  • NSVF-W (our implementation NeRF-W with added voxel preparation in style NSVF) also produces results with a consistent appearance of objects, but the result looks dull and lacks fine detail.

The last column presents GANcraft results that both preserve the appearance of the objects and are of high quality. The use of neural rendering guarantees the consistency of the appearance of objects, and innovations in the architecture of the model and the training scheme allow for unprecedented photorealism.

Distribution mismatch and pseudo-true data

Let’s say we have a suitable voxel-trained neural rendering model capable of creating a photorealistic world. We still need some way to train it without true images with the specified camera position. Adversarial training has achieved some success on small size neural rendering tasks without conditions when images indicating their position are not available. However, for GANcraft, the problem is even more complex. Unlike the real world, block worlds from Minecraft usually have a much more varied distribution of labels. For example, some scenes may be completely covered in snow, sand, or water. There are also scenes in which many biomes intersect in a small area. In addition, it is impossible to match the sampled distribution of cameras with the distribution in photographs from the Internet when randomly sampling viewpoints from a neural rendering model.

Examples of images generated without pseudo-true data:

Examples of images generated with pseudo-true data:

As you can see in the first examples, adversarial training using photographs from the Internet, due to the complexity of the task, leads to unrealistic results.

Creation and use pseudo-true data for learning has become one of the most important contributions to our work and has significantly improved results (second examples).

Generating pseudo-true data

Pseudo-true data are photorealistic images generated from segmentation masks using a pretrained model SPADE… Since the segmentation masks are sampled from the block world, the pseudo-true data has the same labels and camera positions as the images generated for the same viewpoints. This not only reduces the misalignment of the distribution of labels and cameras, but also allows us to use stronger losses, for example, the functions of perceptual loss and L2, for faster and more stable learning.

Voxel-limited hybrid neural rendering

In GANcraft, we create a photorealistic scene by combining a 3D volumetric renderer and a 2D space image renderer. We set a neural glow field, limited by voxels: having data from the block world, we assign a vector of characteristics to each corner of the blocks, and use trilinear interpolation to set the location code in arbitrary places inside the voxel. Due to this, it is possible to set the radiation field using MLP, which receives a location code, a semantic label and a general style code at the input, and at the output creates a point object (similar to radiation data) and its volume density. Knowing the parameters of the camera, we render the radiation field to obtain a map of 2D characteristics, which is converted into an image using CNN.

Complete GANcraft architecture

This two-stage architecture dramatically improves image quality while reducing computation and memory footprint because the radiation field can be modeled with a simpler MLP. Our proposed architecture is capable of handling very large worlds. In our experiments, we used 512 × 512 × 256 voxel grids, which is equivalent to 0.26 square kilometers.

Neural dome of the sky

The old voxel-based neural rendering techniques could not model the sky at infinity from the scene. However, the sky is an essential ingredient in photorealism. To add the sky, we use an additional MLP in GANcraft. MLP converts the direction of the camera beam into a feature vector that has the same size as the point objects from the radiation field. This feature vector is used as a fully opaque final sample of the beam, blending the pixel characteristics according to the remaining beam transmittance.

Generation of images with a variety of appearances

The GANcraft generation process is conditional and depends on the style of the image. During training, we use pseudo-true data as the defining image style, which allows us to explain the discrepancy between the generated image and the pseudo-true data loss during reconstruction. During computations, we can control the style of the output by passing different style images to GANcraft. In the example below, we linearly interpolate the style between six different style images.

Interpolation between multiple styles

Other results


  • GANcraft is a powerful tool for transforming semantic block worlds into photorealistic worlds without the need for true data.
  • The existing methods show poor results due to the inconsistency of objects when changing the viewpoint and the lack of photorealism.
  • GANcraft performs well in such a complex task of transforming “from world to world” when the true data is not available, and the disparity between the Minecraft world and photos from the Internet is great.
  • We have presented a new training scheme using pseudo-true data. It significantly improves the quality of the results.
  • We’ve added a hybrid neural rendering pipeline that can efficiently render large and complex scenes.
  • We can control the appearance of the GANcraft results using styling images.


Servers for rent for any purpose – this is about virtual servers from our company.
For a long time we have been using exclusively fast server drives from Intel and do not save on hardware – only branded equipment and the most modern solutions on the market for providing services.

Similar Posts

Leave a Reply