Simulating DOOM via a neural network

Dozens, if not hundreds of 3D artists and developers work on games. However, the approximate image of the pipeline is already known to everyone. Today we are not talking about the neural network that will cut Uncharted 5 or Dark Souls 4 – it is an interesting case among neural networks of recent years. Usually, when we talk about generative AI, we imagine Ideogram, Stable Diffuison or SORA.

But developers from Google have created an engine that generates the gameplay of an existing game. And for now, it's the aging Doom from the 90s. Below, we'll tell you how such a neural network works.

As the developers themselves claim, the first thing they did was train the RL agent. To collect data for training this model, with the purpose of imposing a teacher, Google researchers first train a separate model to interact with the environment. The data set that the RL agent produces after training becomes a dataset for generative AI. By the way, it is developed on the basis of the open Stable Diffusion.

How does RL agent work?

It makes decisions about what actions to take at any given time based on the current state of the environment.

The environment, in turn, responds to the agent's actions by changing its state and providing the agent with new information and reward.

The main task of the agent is to develop a strategy (or policy) that maximizes the total reward in the long term. Cyclically, the neural network improves its actions. AI operates with states, which represent information about the current state of affairs in the environment.

The agent performs actions that change the state of the environment. Third, for each action the agent receives a reward, which can be positive or negative, depending on its impact on achieving the goal. Finally, there is a transition function that describes how the agent's actions change the state of the environment.

In short, a typical RL agent operates on the principle: we get perks and learn to get more perks for doing the right thing. But Google's “student” is a source of future data for the generative neural network, which means it must generate enough information to simulate real/human gameplay. No reward maximization.

However, the training principle is based on the installation of links between action frames. Since the classic DOOM is not a 144 FPS game, then researchers actually do not have problems with training links. The agent trains, receives a response from the environment and, accordingly, learns to interact with the environment using different game skills.

It all looks something like this:

S is the space of hidden (latent) states of the environment. In the case of DOOM, this is, for example, data stored in the program's RAM.

O are the observable projections of these hidden states, the visual data that the player sees on the screen as rendered pixels.

V is a function that transforms hidden states into visible ones, i.e. the game's rendering logic (the process of drawing graphics).

A is a set of actions that the player can perform, such as pressing keys or moving the mouse.

p is a function that describes the probability of transition from one state to another, based on the chosen actions and the current state of the game.

Example for DOOM: When a player presses a key, a process occurs in the program's dynamic memory (S), which is translated into a change on the screen (O), for example, the player turns his head. The game's rendering logic (V) is responsible for correctly showing this change on the screen. At the same time, the player's actions (A) affect the further behavior of the program, and the transition probability function (p) describes how the game reacts to certain commands.

The model predicts what the next observation (for example, the next frame on the screen) will look like based on previous actions and observations.

This is done using a simulation function that predicts observations based on data already obtained from interactions with the environment.

This function is described as a probability distribution of future observations, given past actions and observations. The main goal of the model is to minimize the difference between predicted frames and the actual frames that the game would display under normal conditions.

Stable Diffuion was originally trained on text queries, but it is now adapted to generate frames (or images) based on a sequence of the agent's actions and observations.

The process begins with the model learning from agent data, which includes actions (e.g. keystrokes) and observations (e.g. previous frames).

The model removes the text component that was in the original version and instead uses a sequence of agent actions.

Each action (such as a specific keystroke) is converted into a numerical representation, called an embedding. This process replaces textual control with control over a sequence of actions.

To take into account observations (previous frames), the model first encodes them into a “latent space” using an auto-encoder, compressing the data.

These encoded frames are then added to “noisy” versions of the current frames. Researchers have also tested using so-called “cross-attention” to process past observations, but this approach has not yielded significant improvement.

The main goal of training a model is to minimize the error—the difference between predicted and actual values—using a specially chosen loss function.

This loss function takes into account the rate at which values ​​change across frames, and the model learns to adjust its predictions to make the frames as accurate as possible.

When the model is trained on real data using teacher-forcing (where it is shown real footage during training), it performs well.

But when the model itself begins to predict the next frames based on previous predictions (so-called autoregression), the quality of the predictions quickly deteriorates.

This happens because errors accumulate, and with each subsequent step the predictions become less accurate. To prevent this deterioration, during training the model deliberately “distorts” the frames by adding Gaussian noise (random distortions), and also transmits the level of this noise to the model.

This approach allows the model to learn to correct errors in previous frames. During real work, the model can already control the noise level to optimize the quality of predictions. Even without adding noise, after such training, the quality of images improves significantly.

The Stable Diffusion v1.4 model uses an autoencoder that compresses images by converting them into a so-called latent space (a special representation of data in a reduced form). This autoencoder compresses 8×8 pixel regions into 4 channels.

However, when predicting game frames, such compression causes artifacts – small distortions, especially noticeable on small details, such as the indicators (HUD) at the bottom of the game screen. To improve the quality of images, the authors of the model decided to adjust only the autoencoder decoder, which is responsible for restoring the compressed data back into an image.

To do this, they used the MSE (mean squared error) loss function, which measures the difference between predicted and actual frames. This reduced artifacts and improved visual quality.

It is important to note that this decoder tuning is done separately from the tuning of the main part of the model (U-Net), which is responsible for generating images.

The Stable Diffusion v1.4 model uses an autoencoder that compresses images by converting them into a so-called latent space (a special representation of the data in a reduced form).

This autoencoder compresses 8×8 pixel regions into 4 channels. However, when predicting game frames, this compression causes artifacts – small distortions that are especially noticeable on small details, such as the HUD at the bottom of the game screen.

To improve image quality, the authors of the model decided to tune only the autoencoder decoder, which is responsible for restoring the compressed data back into an image.

To do this, they used the MSE (mean squared error) loss function, which measures the difference between predicted and real frames. Researchers also suggest that even better results could be achieved by using another loss function, LPIPS, which better estimates image quality from a human perception point of view, but this is left for future research 🙂

The decoder tuning is done separately from the main part of the model (U-Net), which is responsible for generating images. Autoregression (predicting the next frames based on the previous ones) is not affected by this tuning, since it works with latent representations (compressed data), and not with pixels directly.

Typically, it takes many steps to create a quality image, but in this study they found that only 4 steps were needed, allowing them to generate images at 20 frames per second.

Each noise removal step takes about 10 milliseconds, and with the autoencoder the total delay per frame is 50 milliseconds – 20 frames per second, which is enough to simulate the game DOOM in real time.

When they tried to use only one noise removal step, the quality deteriorated noticeably, so the authors also experimented with the model distillation method. This method helps the model work with one step, which increases the frame rate to 50 FPS, but the image quality still deteriorates. Therefore, they prefer to use 4 steps without distillation.

The agent is trained using Proximal Policy Optimization (PPO), proposed by Shulman and colleagues in 2017.

A simple convolutional neural network (CNN), similar to the one proposed by Mnih et al. in 2015, is used as the underlying architecture for feature extraction.

During training, the agent receives reduced versions of game frame images and a 160×120 map. In addition, the agent has access to the last 32 actions it performed.

The feature network transforms each image into a vector representation of size 512. Next, the PPO actor and critic are represented by a two-layer fully connected network (MLP), which operates based on a combination of the outputs of the image feature network and the sequence of previous actions.

The agent is trained for the game in the Vizdoom environment – a testing ground for testing.

Total:

There are 8 game sessions running simultaneously, each using a replay buffer of size 512.

Important hyperparameters remain: the discount factor (γ = 0.99) and the entropy coefficient (0.1), which regulates the randomness of the agent's actions.

During each iteration, the model is trained for 10 epochs using batches of size 64 and a constant learning rate of 1e-4. The total number of interaction steps with the environment is 10 million.

As for the simulation models, they are trained based on the already pre-trained Stable Diffusion model version 1.4.

During the training process, all parameters of the U-Net architecture are unfrozen, which allows the model to adapt to a new task.

This is done using a batch size of 128 with a constant learning rate of 2e-5, and optimization is performed using the Adafactor method without weight decay and with gradients clipped to 1.

The parameterization of the loss in the diffusion model is changed to velocity prediction (v-prediction). During training, the probability of removing context from previous frames is 10%, which allows the use of the Classifier-Free Guidance (CFG) technique during model inference.

Training is performed on 128 TPU-v5e devices with data parallelization. Most of the results in the work were obtained after 700 thousand training steps. At the same time, to improve the quality of training, a noise addition method is used, in which the maximum noise level is 0.7, and 10 embedding baskets are used to account for it.

The same parameters are used to train the latent feature decoder model as the main denoiser, except that a larger data batch of size 2048 is used for the decoder.

The training data includes all the trajectories that the agent learned during the reinforcement learning process, as well as the data collected during the model evaluation process. In total, about 900 million frames were generated for training the model.

All images, both in training and inference, have a resolution of 320×240 and are padded to 320×256. The model is trained with a context of length 64, meaning the model receives its own last 64 predictions and the agent's last 64 actions as input.

As a result, the guys get an excellent result, which repeats the game by 80%, but with some nuances. Unfortunately, with a long game – the neural network produces artifacts with accumulation. However, the approach is quite revolutionary and who knows … maybe, coupled with the new NeRF technology, we will see something truly outstanding in the next five years.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *