Games by spec: the flip side of AI ingenuity

Specification games are behavior that satisfies a literal specification of a goal without achieving the intended outcome. We all have experience playing by spec, even if not by that name. Readers may have heard the myth of King Midas and the golden touch, in which the king asks that everything he touches turn to gold, but soon discovers that even food and drinks turn to metal in his hands. In real life, when a student is rewarded for doing a good homework assignment, they might copy another student to get the correct answers instead of studying the material — and thus exploit a loophole in the assignment specification.


This problem also arises when designing artificial agents. For example, a reinforcement learning agent can find the shortest route to receiving a large amount of reward without completing an assignment as intended by a human designer. This behavior is common, and today we collected about 60 examples (combining existing the lists and current contribution from the AI ​​community). In this post, we will look at the possible causes of the game according to the specification, share examples of where it happens in practice, and also argue the need for further work on principled approaches to overcoming the specification problems.

Let’s look at an example. In the assignment building from Lego blocks the desired result was for the red block to be above the blue. The agent was rewarded for the height of the bottom surface of the red block at the moment when he did not touch this block. Instead of going through the relatively difficult maneuver of picking up the red block and placing it on top of the blue one, the agent simply flipped the red block over to collect the reward. This behavior allowed us to achieve our goal (the bottom surface of the red box was high) at the expense of what the designer really cares about (building on the top of the blue box).


Deep reinforcement learning for dexterous data manipulation.

We can look at the specification game from two perspectives. As part of the development of Reinforcement Learning (RL) algorithms, the goal is to create agents that learn to achieve a given goal. For example, when we use Atari games as a benchmark for teaching RL algorithms, the goal is to assess whether our algorithms are capable of solving complex problems. Whether an agent solves a problem, using a loophole or not, is not important in this context. From this point of view, playing by spec is a good sign: the agent has found a new way to achieve this goal. This behavior demonstrates the ingenuity and power of algorithms to find ways to do exactly what we tell them to do.

However, when we want an agent to actually connect Lego blocks, that same ingenuity can create a problem. In a broader framework of construction targeted agentsthat achieve the desired result in the world, the specification game is problematic as it involves the agent exploiting a specification loophole at the expense of the desired result. This behavior is caused by incorrect problem setting, and not by any flaw in the RL algorithm. In addition to designing algorithms, another necessary component of building targeted agents is reward design.

Designing task specifications (reward functions, environment, etc.) that accurately reflect the intent of the human designer is usually difficult. Even with a slight misunderstanding, a very good RL algorithm can find a complex solution that is very different from what is intended; even if a weaker algorithm cannot find this solution and thus get a solution closer to the intended result. This means that the correct definition of the desired result may become more important to achieve it as RL algorithms improve. Therefore, it is important that the ability of researchers to correctly define problems should not lag behind the ability of agents to find new solutions.

We use the term task specification broadly to encompass many aspects of the agent development process. When setting up an RL, the task specification includes not only the reward design, but also the choice of learning environment and supporting rewards. The correctness of the problem statement can determine whether the agent’s ingenuity corresponds to the intended result or not. If the specification is correct, the agent’s creativity yields the desired new solution. This is what allowed AlphaGo to make the famous 37 stroke, who caught the experts in Go by surprise, but played a key role in the second match against Lee Sedol. If the specification is incorrect, it can lead to undesirable game behavior, such as flipping a block. Such solutions are possible, and we have no objective way to notice them.

Now let’s look at the possible reasons for the spec game. One source of misidentification of the reward function is poorly designed reward generation. The formation of rewards makes it easier to assimilate certain goals by giving the agent some reward on the way to solving the problem, instead of rewarding only for the final result. However, the formation of rewards can change optimal policies if they are not are based on perspective… Consider an agent driving a boat in a game Coast runnerswhere the intended goal is to finish the race as quickly as possible. The agent received a formative reward for colliding with green blocks along the race track, which changed the optimal policy to go around circles and collide with the same green blocks over and over again.


Erroneous reward functions in action.

Determining a reward that accurately reflects the desired end result can be a daunting task in itself. In the problem of connecting Lego blocks, it is not enough to indicate that the bottom edge of the red block should be high off the floor, since the agent can simply turn the red block over to achieve this goal. A more complete specification of the desired result would also include that the top face of the red box should be higher than the bottom face and that the bottom face is aligned with the top face of the blue box. It is easy to overlook one of these criteria when determining the result, which makes the specification too broad and potentially easier to satisfy with a degenerate solution.

Rather than trying to create a BOM covering all possible corner cases, we could learn the reward function from the person’s feedback… It is often easier to assess whether a result has been achieved than to state it explicitly. However, this approach can also run into game specification issues if the reward model does not study the true reward function that reflects designer preferences. One possible source of inaccuracies could be human feedback used to train the reward model. For example, an agent running capture task learned to deceive the evaluator by hovering between the camera and the object.


Reinforce deep learning based on human preference.

The trained reward model can also be mis-defined for other reasons, such as poor generalization. Additional feedback can be used to correct the agent’s attempts to exploit inaccuracies in the reward model.

Another class of game by specification comes from an agent exploiting simulator bugs. For example, simulated robot, who had to learn to walk, invented to lock his legs together and slide on the ground.


AI learns to walk.

At first glance, these examples may seem funny, but less interesting and have nothing to do with deploying agents in the real world, where there are no simulator errors. However, the main problem is not the error itself, but the failure of the abstraction that can be used by the agent. In the example above, the robot’s task was incorrectly defined due to incorrect assumptions about the physics of the simulator. Likewise, real-world traffic optimization can be misidentified if it is assumed that the traffic routing infrastructure does not contain software bugs or security vulnerabilities that a sufficiently smart agent might detect. These assumptions don’t need to be made explicitly – rather, they are details that simply never crossed the mind of the designer. And as tasks become too complex to take into account every detail, researchers are more likely to introduce incorrect assumptions when developing a specification. This begs the question: Is it possible to design agent architectures that correct such false assumptions instead of using them?

One of the assumptions commonly used in task specification is that the specification cannot be influenced by the actions of the agent. This is true for an agent operating in an isolated simulator, but not for an agent operating in the real world. Any task specification has a physical manifestation: a reward function stored in a computer, or a person’s preference. An agent deployed in the real world can potentially manipulate these notions of purpose, creating a problem counterfeiting remuneration… For our hypothetical traffic optimization system, there is no clear distinction between satisfying user preferences (for example, by providing useful guidance) and impact on users so that they have preferences that are easier to satisfy (for example, by encouraging them to choose destinations that are easier to reach). The former satisfies the task, while the latter manipulates the world view of the goal (user preferences), and both lead to high rewards for the AI ​​system. As another, more extreme example, a very advanced AI system can take over the computer it is running on, setting its own reward to a high value.

To summarize, there are at least three challenges to overcome when solving a game spec problem:

  • How do we accurately capture the human concept of a given task as a reward function?
  • How can we avoid mistakes in our implicit assumptions about the domain or design agents that correct erroneous assumptions instead of playing with them?
  • How do we avoid falsifying our remuneration?

Many approaches have been proposed, ranging from modeling rewards to developing incentives for agents, the problem of playing by specification is far from being solved. List of possible spec behavior demonstrates the scale of the problem and the sheer number of ways an agent can play by specification These problems are likely to become more complex in the future as AI systems become more capable of satisfying task specification at the expense of their intended outcome. As we build more advanced agents, we will need design principles that specifically address specification issues and ensure that these agents reliably achieve the outcomes intended by the developers.


If you want to know more about machine and deep learning – check out our corresponding course, it will not be easy, but exciting. A promo code HABR – will help in the desire to learn new things by adding 10% to the discount on the banner.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *