Learn to run

About the author of the article

Alexandra Malysheva is a graduate of the bachelor’s degree in “Applied Mathematics and Computer Science” of St. Petersburg Academic University and a graduate of the master’s program of the St. Petersburg HSE in the direction of “Programming and Data Analysis”. In addition, the researcher in the laboratory “Agent systems and reinforcement training” Jetbrains research, as well as a teaching assistant with reinforcement at the undergraduate HSE.

Project Motivation

Quite often, modern technology is used to create more interesting games or beautiful virtual worlds. However, progress in technical sciences can be used in medicine. For example, now the development of prostheses is very expensive, since it requires enumeration of a large number of different designs. At the same time, it is necessary to attract real people to test the created prosthesis, because at the moment, the interaction of a person with prostheses is not well understood.

In 2017, at Stanford University, scientists implemented computer simulations of the human musculoskeletal system to predict how a person adapts to a particular prosthesis design, and to be able to test each specific design virtually in seconds. When at Stanford a team of physicists and biologists developed an accurate model of the human musculoskeletal system in terms of physics and biomechanics, it remains to write an algorithm that could control this system in a way similar to human control.

NeurIPS 2018: Artificial Intelligence for Prosthetics challenge

One way to obtain such an algorithm is to use reinforced learning methods, since such algorithms are capable of solving complex problems, adapting to a specific environment.

In order to motivate people to create such an algorithm, at one of the largest conferences on machine learning NeurIPS decided to hold a competition in which it was necessary to teach a three-dimensional skeleton to run at a strictly specified speed. To do this, the Stanford University laboratory adapted its environment for the Learn to Run 2017 competition.

In a previous article, we talked about taking part in the Learn to Walk competition in 2016. And in 2018, our laboratory decided to participate in the AI ​​for Prosthetics competition to test existing ideas.

OpenSim Simulator

I would like to tell you more about the environment in which the competition took place. In 2016, Stanford’s Neuromuscular Biomechanics Laboratory provided participants with a OpenSim human musculoskeletal system simulator adapted for reinforced learning. They also recorded a metric that allows you to evaluate the quality of the model.

The task that the competitors are invited to solve is to develop a controller to simulate the movements of a person with a prosthetic leg, controlling the model by changing muscle tension, and making the model walk or run at a given speed, which can change over time.

Interaction with the OpenSim environment is divided into episodes. Each episode consists of medium steps; in the case of OpenSim, each episode is limited to 1000 medium steps, which corresponds to ten seconds of real time. The episode ends early if the person’s model has fallen, i.e. the pelvis was below 60 centimeters above ground level. As an observed state, the medium returns the absolute coordinates, velocities, accelerations, rotation angles, angular velocities and angular accelerations of body parts, as well as muscle tension and the speed with which the agent should move.

The reward function is as follows:
$ inline $ R (s, a, s ′) = 10− || v_ {target, s ′} −v_ {pelvis, s ′} || 2−0.001 · || a_ {s ′} || 2 $ inline $,
Where $ v_ {target, s ′} $ – the speed with which the agent should move in a new state, $ v_ {basin, s ′} $ – the speed with which the agent moves, $ a_ {s ′} $ – muscle tension in a new state. In this case, the target speed changes approximately every 300 steps of the medium.

Base implementation

As in the previous competition, which you can read about here, we used the DDPG algorithm, because in article it was noted that it shows the best results compared to peers such as TRPO and PPO.
First, we implemented basic modifications that helped us get started: for example, using relative coordinates, angles and speeds instead of absolute ones, in the observed state we greatly increase the learning speed.

The first modification tested is the change in the reward observed by the agent to a simplified one. We decided to remove the calculation of the bond stresses of the agent, since it took a long time, and normalize the difference in speeds.
$ inline $ R (s, a, s’) = 10 – || v_ {target, s ′} −v_ {pelvis, s ′} || ^ 2−0.001 · || a_ {s ′} || ^ 2 $ inline $

$ r = 1-  frac {|| v_ {target, s ′} - v || ^ 2} {|| v_ {target, s ′} || ^ 2} $

In the figure, you can see that the new agent trained with a modified award (Reward Shaping) is statistically significantly superior to the agent observing the original award (Baseline). This means that we were able to save computing resources without sifting in terms of the original award.

Feature engineering and potential features

The next tested modification is the addition of new features, such as the acceleration of body parts and the speed of cornering.

Almost the entire previous article is about potential features. You can also read about them in articletherefore we will not dwell on them for a long time.

I just want to note that this time we could not find enough videos of running people with one prosthesis, so we had to record them ourselves. In addition, this year it was required to run at a given speed (unlike last year, when it was necessary to develop the highest speed), so we used different videos for different given speeds.

The graph shows that an agent with additional features (Feature Engineering) significantly exceeds the base implementation of the agent.

Multiprocess Distributed Learning

OSim is an accurate physical simulator of the human musculoskeletal system. To achieve high accuracy of simulation, the environment requires a significant amount of computation after each action performed by the controller. This leads to poor performance and slow operation of the environment. Moreover, the simulator does not use parallel computing, which does not allow it to use all available computing resources.

One possible solution to this problem is to run several environments at once. Since each OSim environment works independently of the others and does not use parallel computing, such a solution will not affect the performance of each running environment. But all running environments in the learning process must interact with the trained agent. To fulfill these requirements, a framework for distributed multithreaded training was designed. The framework is divided into two parts: client and server. Parts communicate over the Internet using the HTTP protocol.

The server is responsible for executing client requests in real environments: restarting the environment and performing the action by the agent. After each request, the server returns to the client the current state of the environment as a response.

The client, in turn, is also divided into several parts: training processes, data sampling processes and interaction with the server. For each real environment running on the server, a virtual environment is created that provides an interface to the real environment. This allows the client to interact with remote environments in the same way as with real ones. Since the agent needs to perform actions in it to get experience from the environment, the processes responsible for the agent’s actions (Model Worker) were added to the scheme. They receive states from the currently running environments and return the action that the agent wishes to perform in the received state.

Sampling processes are used to preserve and reuse during training the experience gained from OSim. Each of them stores the transition buffer that the agent observed, and randomly collects data batches. The collected batches are transferred to training processes that update the neural network in accordance with the selected algorithm.
This approach can significantly increase the speed of training an agent in a low-productivity environment by increasing the amount of experience gained from the environment.


Below is a table with the results at the end of the episodes and the incidence rate of trained agents. DDPG corresponds to the basic implementation, Feature Engineering to the addition of new features, Reward Shaping to the presence of a modification of the award, Ensembles to the compilation of several versions of a single agent.

These techniques helped us to take sixth place in the overall competition.
Separately, I want to note that, perhaps due to the use of potential functions, our agent learned to walk “humanly”. For example, below are the results of the projects that took the first two places, and ours 🙂

Opponents results:

Our result:

I hope this post motivates you to try to participate in reinforcement training competitions! Good luck 🙂

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *