Training smart gaming rivals in Unity using the “play with yourself” method using ML-Agents

Hello, Habr!

As our regular readers know, we have long and successfully published the books by Unity. As part of the study of the topic, we were interested in, in particular, the tools ML-Agents Toolkit. Today we bring to your attention a translation of an article from the Unity blog about how to effectively train game agents using the “with oneself” method; in particular, the article helps to understand why this method is more effective than traditional reinforced learning.

Enjoy reading!

This section then gives an overview of the self-play technology (playing with oneself) and demonstrates how to use it to provide stable and effective training in the Soccer demo environment from the toolkit. ML-Agents Toolkit.

In the Tennis and Soccer demo environments from the Unity ML-Agents Toolkit, agents pit each other like rivals. Training agents in such a competitive scenario is sometimes a very non-trivial task. In fact, in previous releases of the ML-Agents Toolkit, in order for the agent to confidently learn, a serious study of the award was required. AT version 0.14 a feature was added that allows the user to train agents using reinforcement learning (RL) based on self-play, a mechanism of fundamental importance in achieving some of the most high-end reinforcement learning outcomes, for example, Openai five and DeepMind’s AlphaStar. Self-play at work pits with each other the current and past hypostases of the agent. Thus, we get an adversary for our agent, who can gradually improve using traditional reinforcement learning algorithms. Fully trained agent can successfully compete with advanced human players.

Self-play provides a learning environment that is built on the same principles as competition from a human perspective. For example, a person who learns to play tennis will choose to spar opponents at about the same level as himself, since an opponent too strong or too weak is not so convenient for mastering the game. From the point of view of developing their own skills, it can be much more valuable for a novice tennis player to beat the same beginners, rather than, say, a preschool child or Novak Djokovic. The first one will not even be able to hit the ball, and the second will not give you such a serve that you can beat. When a beginner develops sufficient strength, he can move on to the next level or apply for a more serious tournament to play against more skilled opponents.

In this article, we will consider some technical subtleties associated with the dynamics of the game with ourselves, and also consider examples of working in virtual environments Tennis and Soccer, refactored in such a way as to illustrate the game with itself.

The story of a game with yourself in games

The phenomenon of playing with oneself has a long history, reflected in the practice of developing artificial game agents designed to compete with people in games. One of the first to use this system was Arthur Samuel, who developed a chess simulator in the 1950s and published this job in 1959. This system became the forerunner of a landmark result in reinforcement learning achieved by Gerald Thesauro in TD-Gammon; totals published in 1995. TD-Gammon used the TD (λ) time-difference algorithm with the function of playing with itself to train the agent to play backgammon so that he could compete with a professional person. In some cases, it has been observed that TD-Gammon has a more confident vision of positions than world-class players.

Playing with yourself is reflected in many of the iconic achievements associated with RL. It is important to note that playing with yourself helped develop agents for playing chess and gowith superhuman abilities, elite agents DOTA 2as well as complex strategies and counter-strategies in games such as wrestling and hide and seek. In the results achieved by playing with oneself, it is often noted that game agents choose strategies that surprise expert people.

Playing with yourself gives agents a certain creativity that is independent of the creativity of programmers. The agent receives only the rules of the game, and then – information about whether he won or lost. Further, based on these basic principles, the agent must develop competent behavior. According to the creator of TD-Gammon, such an approach to learning liberates, “… in the sense that the program is not constrained by human inclinations and prejudices, which may turn out to be erroneous and unreliable.” Thanks to this freedom, agents discover brilliant game strategies that completely change the way engineers think about certain games.

Competitive reinforcement training

Within the framework of the traditional task of reinforced learning, the agent is trying to develop a line of behavior that maximizes the total reward. The rewarding signal encodes the agent’s task – such a task may be, for example, plotting a course or collecting items. Agent behavior is subject to environmental restrictions. Such, for example, gravity, obstacles, as well as the relative influence of actions taken by the agent himself – for example, the application of force for one’s own movement. These factors limit the agent’s behavior and are external forces that he must learn to handle in order to receive a high reward. Thus, the agent competes with the dynamics of the environment, and must move from state to state precisely so that the maximum reward is achieved.

The typical reinforcement training scenario is shown on the left: the agent acts in the environment, transfers to the next state and receives a reward. The training scenario is shown on the right, where the agent competes with a rival, which, from the point of view of the agent, is actually an element of the environment.

In the case of competitive games, the agent competes not only with the dynamics of the environment, but also with another (possibly intellectual) agent. We can assume that the opponent is built into the environment, and his actions directly affect the next state that the agent “sees”, as well as the reward that he will receive.

Example Tennis from ML-Agents Toolkit

Consider the demo of ML-Agents Tennis. The blue racket (left) is the learning agent, and the purple (right) is his opponent. To throw the ball over the net, the agent must take into account the trajectory of the ball flying from the opponent, and make an adjustment to the angle and speed of the flying ball, taking into account environmental conditions (gravity). However, in a competition with an opponent, throwing the ball over the net is only half the battle. A strong opponent can respond with an irresistible blow, and as a result, the agent will lose. A weak opponent may hit the ball into the net. An equal opponent can return the serve, and therefore the game will continue. In any case, both the next state and the corresponding reward depend on both environmental conditions and the opponent. However, in all these situations, the agent makes the same pitch. Therefore, both learning in competitive games and pumping up rival behaviors by an agent are complex problems.

Considerations for a suitable opponent are not trivial. As is clear from the above, the relative strength of the opponent significantly affects the outcome of a particular game. If the opponent is too strong, then the agent may find it difficult to learn how to play from scratch. On the other hand, if the opponent is too weak, then the agent can learn to win, but these skills may be useless in competition with a stronger or simply different opponent. Therefore, we need an opponent who will be approximately equal in strength to the agent (unyielding, but not insurmountable). In addition, since the skills of our agent improve with each game completed, we must increase the strength of his opponent to the same extent.

When playing with yourself, a snapshot from the past or an agent in its current state is the opponent built into the environment.

This is where the game with ourselves comes in handy! The agent himself satisfies both requirements for the desired opponent. He is definitely roughly equal in strength to himself, and his skills improve over time. In this case, the agent’s own policy is built into the environment (see the figure). Those who are familiar with increasingly difficult learning (curriculum learning), we’ll show that this system can be considered a naturally developing curriculum, following which the agent learns to fight against increasingly powerful opponents. Accordingly, playing with yourself allows you to use the environment itself to train competitive agents for competitive games!

In the next two sections, we will consider more technical details of training competitive agents, in particular, regarding the implementation and use of the game with oneself in the ML-Agents Toolkit.

Practical considerations

Some practical problems arise regarding the framework for playing with yourself. In particular, retraining is possible, in which the agent learns to win only with a certain style of play, as well as the instability inherent in the learning process, which can arise due to the unsteadiness of the transition function (that is, due to constantly changing opponents). The first problem arises because we want our agents to have a general understanding and ability to fight opponents of different types.
The second problem can be illustrated in the Tennis environment: different opponents will hit the ball at different speeds and at different angles. From the point of view of the learning agent, this means that, as you learn, the same decisions will lead to different outcomes and, accordingly, the agent will be in different subsequent situations. In traditional reinforcement learning, stationary transition functions are implied. Unfortunately, having prepared a selection of various opponents for the agent in order to solve the first problem, we, being careless, can aggravate the second.

To cope with this, we will maintain a buffer with past agent policies, from which we will choose potential rivals for our “student” for the long term. Choosing an agent from past policies, we get for him a selection of diverse opponents. Moreover, allowing the agent to train with a fixed opponent for a long time, we stabilize the transition function and create a more consistent learning environment. Finally, these algorithmic aspects can be controlled using hyperparameters, which are discussed in the next section.

Implementation and Usage Details

Choosing hyperparameters for playing with ourselves, we, first of all, keep in mind a compromise between the level of the opponent, the universality of the final policy and the stability of training. Training in rivalry with a group of opponents that change slowly or do not change at all, and, therefore, give a smaller scatter of results, is a more stable process than training in rivalry with many diverse rivals that change quickly. Available hyperparameters allow you to control how often the current agent policy will be saved for later use as one of the opponents in the sample, how often a new opponent will be saved, subsequently selected for sparring, how often a new opponent will be selected, the number of opponents saved, and the likelihood that in this case, the student will have to play against his own alter ego, and not against an opponent selected from the pool.

In competitive games, the “cumulative” award issued by the environment is perhaps not the most informative metric for tracking learning progress. The fact is that the accumulative award entirely depends on the level of the opponent. An agent with a certain game skill will receive a greater or lesser reward, depending on a less skilled or more skillful opponent, respectively. We offer implementation ELO rating system, allowing you to calculate the relative game skill of two players from a certain population when playing with a zero amount. During a single training run, this value should steadily increase. You can track it, along with other learning metrics, for example, a general award, using Tensorboard.

Playing with yourself in Soccer

The latest releases of the ML-Agent Toolkit do not include agent policies for the Soccer learning environment, since the reliable training process was not built in it. However, using the game with ourselves and some refactoring, we can train the agent in non-trivial behaviors. The most significant change is the removal of “gaming positions” from the agent’s characteristics. Earlier in the Soccer environment, the “goalkeeper” and “striker” clearly stood out, so the whole gameplay looked more logical. AT this video a new environment is presented in which one can see how role behavior is spontaneously formed, in which some agents begin to act as attackers and others as goalkeepers. Now the agents themselves are learning to play these positions! The reward function for all four agents is defined as +1.0 for a goal scored and -1.0 for a goal conceded, with an additional penalty of -0.0003 per step – this penalty is provided to stimulate agents to attack.

Here we emphasize once again that the agents in the Soccer learning environment themselves learn cooperative behavior, and for this, no explicit algorithm is used related to multi-agent behavior or role assignment. This result demonstrates that an agent can be trained in complex behaviors using relatively simple algorithms – provided that the task is well-formulated. The most important condition for this is that agents can observe their teammates, that is, they receive information about the relative position of the teammate. Forcing an aggressive fight for the ball, the agent indirectly tells the teammate that he should move in defense. On the contrary, moving away in defense, the agent provokes a teammate to attack.

What’s next

If you have ever used any of the new features from this release – tell us about them. We draw your attention to the page ML-Agents GitHub issues, where you can talk about bugs found, as well as to the page Unity ML-Agents Forumswhere general issues and problems are discussed.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *