How to make AI for finding diamonds in Minecraft
Reinforcement Learning and Human Behavior Simulation with MineRL

In a randomly generated Minecraft world, let’s find diamonds with the help of AI. How will a reinforcement trained agent perform in one of the game’s toughest challenges? We share the details before the start flagship course in Data Science.
Minecraft is a massive game with a lot of mechanics and complex action sequences. Just to teach people how to play this game, wrote an entire encyclopedia of 8,000 pages.

The discussion is not limited to Minecraft; approach can be applied in the same complex environments. Strictly speaking, we will implement two different methods that will form the basis of our intelligent agent.
But before you train an agent, you need to understand how to interact with the environment. Let’s start with the script bot and get acquainted with the syntax. We will work with MineRL – an amazing library for creating artificial intelligence applications in Minecraft.
The code from the article is available at Google Colab. This is a simplified and improved version of the amazing notebooks written by the organizers tournament MineRL 2021 (MIT license).
I. Script bot
MineRL allows you to run Minecraft in Python and interact with the game. The interaction is implemented through the popular gym library:
env = gym.make('MineRLObtainDiamond-v0')
env.seed(21)

We are standing in front of a tree. As you can see, the resolution is quite low. Low resolution uses fewer pixels, which speeds up the work. Lucky for us, neural networks don’t need 4K resolution to understand what’s going on.
We want to interact with the game. What can our agent do? Here list possible actions:

The first step when looking for diamonds is to get wood to craft a workbench and a wooden pickaxe.
Let’s try to get closer to the tree. That is, we need to hold the Forward button for less than a second. MineRL processes 20 actions per second: we don’t need a whole second, so let’s repeat the Forward action 5 times and wait another 40 ticks:

# Define the sequence of actions
script = ['forward'] * 5 + [''] * 40
env = gym.make('MineRLObtainDiamond-v0')
env = Recorder(env, './video', fps=60)
env.seed(21)
obs = env.reset()
for action in script:
# Get the action space (dict of possible actions)
action_space = env.action_space.noop()
# Activate the selected action in the script
action_space[action] = 1
# Update the environment with the new action space
obs, reward, done, _ = env.step(action_space)
env.release()
env.play()

Okay, now let’s cut down the tree. In total, we need 4 actions:
Forward – stand in front of a tree;
Attack – cut down a tree;
Camera – look up or down;
Jump – get a finished piece of wood.

Controlling the camera can be tricky. To simplify the syntax, we use the str_to_act function from this GitHub repository (MIT license). The new script now looks like this:
script = []
script += [''] * 20
script += ['forward'] * 5
script += ['attack'] * 61
script += ['camera:[-10,0]'] * 7 # Look up
script += ['attack'] * 240
script += ['jump']
script += ['forward'] * 10 # Jump forward
script += ['camera:[-10,0]'] * 2 # Look up
script += ['attack'] * 150
script += ['camera:[10,0]'] * 7 # Look down
script += [''] * 40
for action in tqdm(script):
obs, reward, done, _ = env.step(str_to_act(env, action))
env.release()
env.play()
The agent successfully cut down an entire tree. Not bad for a start, but I would like more independence from AI …
II. Deep Learning
Our bot works well in a fixed environment. But what happens if we change the seed or its starting point? Everything is written in the script, so the agent will most likely try to cut down a non-existent tree.
This approach is too static for our purposes: we need something that can adapt to new environments. And instead of the commands given in the script, we need artificial intelligence that would know how to cut trees. Of course, a suitable basis for training such an agent is reinforcement learning. More specifically, deep reinforcement learning would be a suitable option, since we are processing images from which the correct actions are selected.
There are two ways to implement deep reinforcement learning:
True deep reinforcement learning: an agent learns from scratch by interacting with the environment. He is rewarded every time he cuts down a tree.
Simulation learning: The agent learns how to cut trees from a set of data. Here, the data is a sequence of tree felling actions performed by a person.
Both approaches give the same result, but they are not the same. The authors tournament MineRL 2021 writes that the same level of performance is achieved in reinforcement learning in 8 hours, and in simulation training in 15 minutes.
We don’t have that much time, so we’re going with simulation learning. This technique is also called behavior cloning., that is, the simplest form of imitation.
Note that simulation learning is not always better than reinforcement learning. If you want to read more, then Kumar and co-authors wrote an excellent article about this article in the blog.

The problem boils down to a classification problem with multiple data classes. The dataset consists of videos in mp4, so we will use convolutional neural network (CNN) to translate the image data into appropriate actions. In addition, our goal is to limit the number of possible actions (classes). Thus, the CNN will have fewer options available, and training will be more efficient:
class CNN(nn.Module):
def __init__(self, input_shape, output_dim):
super().__init__()
n_input_channels = input_shape[0]
self.cnn = nn.Sequential(
nn.Conv2d(n_input_channels, 32, kernel_size=8, stride=4),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.Conv2d(32, 64, kernel_size=4, stride=2),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.Conv2d(64, 64, kernel_size=3, stride=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.Flatten(),
nn.Linear(1024, 512),
nn.ReLU(),
nn.Linear(512, output_dim)
)
def forward(self, observations):
return self.cnn(observations)
def dataset_action_batch_to_actions(dataset_actions, camera_margin=5):
...
class ActionShaping(gym.ActionWrapper):
...
In this example, we manually defined 7 suitable actions: Attack – attack, Forward – move forward, Jump (jump) and move the camera left, right, up and down. Another popular approach is to use the method k-means for automatic selection of relevant human actions. In any case, the point is to remove the most useless actions for solving the problem. Our task is the production (crafting) of things.
Let’s train our CNN on the MineRLTreechop-v0 dataset. Other datasets can be found here. We chose a learning rate of 0.0001 and 6 epochs with a batch size of 32:
# Get data
minerl.data.download(directory='data', environment="MineRLTreechop-v0")
data = minerl.data.make("MineRLTreechop-v0", data_dir="data", num_workers=2)
# Model
model = CNN((3, 64, 64), 7).cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
criterion = nn.CrossEntropyLoss()
# Training loop
step = 0
losses = []
for state, action, _, _, _ \
in tqdm(data.batch_iter(num_epochs=6, batch_size=32, seq_len=1)):
# Get pov observations
obs = state['pov'].squeeze().astype(np.float32)
# Transpose and normalize
obs = obs.transpose(0, 3, 1, 2) / 255.0
# Translate batch of actions for the ActionShaping wrapper
actions = dataset_action_batch_to_actions(action)
# Remove samples with no corresponding action
mask = actions != -1
obs = obs[mask]
actions = actions[mask]
# Update weights with backprop
logits = model(torch.from_numpy(obs).float().cuda())
loss = criterion(logits, torch.from_numpy(actions).long().cuda())
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Print loss
step += 1
losses.append(loss.item())
if (step % 4000) == 0:
mean_loss = sum(losses) / len(losses)
tqdm.write(f'Step {step:>5} | Training loss = {mean_loss:.3f}')
losses.clear()
Step 4000 | Training loss = 0.878
Step 8000 | Training loss = 0.826
Step 12000 | Training loss = 0.805
Step 16000 | Training loss = 0.773
Step 20000 | Training loss = 0.789
Step 24000 | Training loss = 0.816
Step 28000 | Training loss = 0.769
Step 32000 | Training loss = 0.777
Step 36000 | Training loss = 0.738
Step 40000 | Training loss = 0.751
Step 44000 | Training loss = 0.764
Step 48000 | Training loss = 0.732
Step 52000 | Training loss = 0.748
Step 56000 | Training loss = 0.765
Step 60000 | Training loss = 0.735
Step 64000 | Training loss = 0.716
Step 68000 | Training loss = 0.710
Step 72000 | Training loss = 0.693
Step 76000 | Training loss = 0.695
The model is trained. Now we can create an instance of the environment object and see how the model behaves. If the training was successful, the model should continuously cut down all trees in the field of view.
This time we’ll use the ActionShaping wrapper. It is needed to match the array of numbers created using dataset_action_batch_to_actions with discrete actions from MineRL.
Our model needs first-person observation in the correct format, and at the output it displays logs. Using the softmax function, these logs can be translated into a probability distribution in a set of 7 actions. Next, we randomly select an action according to its probability. The selected action is implemented in MineRL via env.step(action).
This process can be repeated any number of times. Let’s repeat it 1,000 times and see what happens:
model = CNN((3, 64, 64), 7).cuda()
model.load_state_dict(torch.load('model.pth'))
env = gym.make('MineRLObtainDiamond-v0')
env1 = Recorder(env, './video', fps=60)
env = ActionShaping(env1)
action_list = np.arange(env.action_space.n)
obs = env.reset()
for step in tqdm(range(1000)):
# Get input in the correct format
obs = torch.from_numpy(obs['pov'].transpose(2, 0, 1)[None].astype(np.float32) / 255).cuda()
# Turn logits into probabilities
probabilities = torch.softmax(model(obs), dim=1)[0].detach().cpu().numpy()
# Sample action according to the probabilities
action = np.random.choice(action_list, p=probabilities)
obs, reward, _, _ = env.step(action)
env1.release()
env1.play()
Our agent is rather chaotic, but in this new, unfamiliar environment he still manages to cut down trees. Now… how do you find diamonds?
III. Script + simulation training
A simple but effective approach is to combine scripted actions with artificial intelligence actions. We teach everything boring and write down knowledge in a script.
In this paradigm, we need a convolutional network that will allow us to get a large amount of wood (3,000 steps). Then in the script prescribes a sequence of actions for the manufacture of boards, sticks, a workbench, a wooden pickaxe and proceed to the extraction of stone (it should be under our feet). This stone can be used to craft a stone pickaxe that can be used to mine iron ore.

This is where things start to get complicated: iron ore is found quite rare, therefore, to search for its deposits, you will have to start the game for a while. Next, you need to create a furnace and smelt the ore to get an iron pickaxe. And finally, you have to go even deeper and be able to get the diamond without falling into the lava.
As you can see, this is doable, but the result is unpredictable. We could train a second agent look for diamondsand the third create iron pick. If you are interested in more complex approaches, you can read about the results tournament MineRL Diamond 2021. Kanervisto et al described several solutions based on various ingenious methods, including end-to-end deep learning architectures. It must be said that this is a difficult task, and no team has been able to find diamonds all the time … if they have found anything at all.
For this reason, in the example below, we will limit ourselves to creating a stone pick, but the code can be changed to go further:
obs = env_script.reset()
done = False
# 1. Get wood with the CNN
for i in tqdm(range(3000)):
obs = torch.from_numpy(obs['pov'].transpose(2, 0, 1)[None].astype(np.float32) / 255).cuda()
probabilities = torch.softmax(model(obs), dim=1)[0].detach().cpu().numpy()
action = np.random.choice(action_list, p=probabilities)
obs, reward, done, _ = env_script.step(action)
# 2. Craft stone pickaxe with scripted actions
for action in tqdm(script):
obs, reward, done, _ = env_cnn.step(str_to_act(env_cnn, action))
env_cnn.release()
env_cnn.play()
It can be seen that in the first 3,000 steps, the agent cuts trees like crazy, and then the script fires and the task is executed. It may not be so obvious, but the command print(obs.inventory)
shows a stone pick. Note that this is a featured example; most launches do not complete as well.
There are several reasons why an agent fails: it can enter a hostile environment (water, lava, etc.), an area without forest, or even fall and die. If you experiment with different seeds, you can better understand the complexity of the problem and, hopefully, get ideas for creating even more “talented” agents.
Conclusion
I hope you enjoyed this little guide to reinforcement learning in Minecraft. It is not just a popular game, but also an interesting environment for testing AI agents trained with reinforcements. Here, as in NetHack, you need to have a very good understanding of game mechanics in order to plan the exact sequence of actions in a procedurally generated world. In this article we:
learned how to use MineRL;
considered 2 approaches (script, behavior cloning) and their combination;
showed the actions of the agent in short clips.
The main problem of the environment is the low speed of its processing. Minecraft is not as easy a game as NetHack or Pong, so agents need a lot of time to learn. If this is a problem for you, I recommend taking a closer look at the environments easier, for example Gym Retro.
Thank you for your attention! If you are wondering how to use AI in video games, then subscribe me on Twitter. And we will help you upgrade your skills or master a profession that is relevant at any time from the very beginning:
Choose another in-demand profession.
