A guide to Reinforcement Learning for beginners. Implementing a simple task
Disclaimer: The author of the translation is new to ML/DS, so he is open to any edits/additions to the article
Content:
What is reinforcement learning?
Most of you have probably heard that artificial intelligence has learned to play computer games on its own. A very popular example is the company Deepmind, which made news and took the world by storm when its program AlphaGo defeated the world Go champion from South Korea in 2016.
So what is the secret of this monumental breakthrough? Hold your horses! You'll understand this in a couple of minutes.
Let's look at the analogy of teaching a dog new tricks. In this scenario, we simulate a situation and the dog tries to react to it in different ways. If the dog's reaction is desired, we reward it with food. Otherwise, we make it clear in one way or another that her reaction is not the desired one.
Now, every time the dog finds itself in the same situation, it performs the same action with even more enthusiasm, expecting to receive more food. Essentially, she learns what to do based on positive experiences. Likewise, she will learn what not to do when faced with negative experiences.
That's exactly how it works Reinforcement Learning in a broad sense
The dog is an agent (agent), which is exposed to our environment (environment). The environment can be considered your home and you.
The situations that arise are similar to the state (state). An example of a condition might be your dog standing and you using a certain word in a certain tone in your living room.
Our agent reacts by performing the action of transitioning from one state to another. For example, your dog goes from standing to running to catch a stick.
After the transition, she can receive a reward (reward) or punishment (penalty). You give a treat as a reward or simply say “No” as a punishment.
Policy is a strategy for choosing an action under a certain condition in anticipation of the best result.
And if you put it all together…
Task Reinforcement Learning (RL) is learning agentwhich interacts with surroundings. The agent moves between different environmental scenarios called statesperforming actions. Actions, in turn, bring rewardwhich can be positive, negative or zero.
The agent's sole goal is to maximize the concept of cumulative reward per episode, that is, everything that happens between the initial and final state, where we determine the reward corresponding to the tasks we want to solve.
Thus, we induce the agent to perform certain actions by providing him with a positive reward, and to refuse other actions by providing him with a negative reward. This is how the agent learns to develop a strategy or policy.
R.L. – one of three main machine learning paradigmsalong with supervised and unsupervised learning.
Two important points to note…
Greed doesn't always help
There are things that are easy to do for instant gratification, and there are those that provide long-term rewards. The goal is not to be greedy, looking only for quick immediate rewards, but to optimize them for maximum rewards throughout the training.Consistency matters in reinforcement learning
The agent's reward depends not only on the current state, but also on the entire history of states. Unlike supervised and unsupervised learning, time is of the essence here.
In a sense, RL is the science of making optimal decisions based on experience. If you break it down, the process includes the following simple steps:
Environmental monitoring
Deciding how to act using a particular strategy
Act accordingly
Receiving reward or punishment
Learn from experience and improve strategy
Iterate until the optimal strategy is found
returning to AlphaGo…
AlphaGo – this is a classic example RL agent, who learned to play the game of Go to maximize his reward. In this lesson we will look at RL using the example of developing an agent that will independently learn andplay the game automatically.
RL is not limited to just games! It is used for managing stock portfolios and finances, for creating humanoid robots, for manufacturing and inventory management, for developing general purpose AI agents, etc…
Development of a self-driving car from scratch (Smart Cab)
Let's develop a simulation of a self-driving taxi (Smart Cab). The main goal is to demonstrate in a simplified environment how RL techniques can be used to develop an efficient and safe approach to solve this problem.
The task of Smart Cab is to pick up a passenger in one place and drop him off at another. What we want our Smart Cab to take care of:
Drop off the passenger at the desired location.
Save passenger time by allowing minimal time for boarding.
Take care of passenger safety and compliance with traffic rules.
When modeling an RL solution to this problem, various aspects need to be taken into account: rewards, states and actions.
1. Rewards
Because the agent (imaginary driver) is reward motivated and is going to learn how to operate the cab through trial and error, we need to determine reward and/or punishment and their size accordingly. Here are a few points to consider:
The agent must receive large positive reward for a successful landing, since such behavior is highly desirable.
The agent must be punishedif he tries to drop off a passenger in the wrong place.
The agent should receive a slightly negative reward for not reaching the destination after each time step.
“Slightly” is negative because we would rather our agent arrive late rather than make false moves to try to get to their destination as quickly as possible.
2. State Space
In RL, an agent encounters a state and then takes action according to the state it is in.
The state space is the set of all possible situations in which our taxi may find itself. The state must contain useful information necessary for the agent to perform the correct action.
Let's say we have a training ground for a Smart Cab where we teach it to transport people in a parking lot to four different locations (R, G, Y, B):
Let's assume that Smart Cab is the only car in this parking lot. We can divide the parking lot into a 5×5 grid, giving us 25 possible taxi locations. These 25 places are one part of our state space. The current location state of our taxi is (3, 1).
There are 4 places where we can pick up and drop off passengers: R, G, Y, B or [(0,0), (0,4), (4,0), (4,3)]
in coordinates (row, col). Our passenger is in place Y and wants to go to place R.
If we have 1 additional state of a passenger being in a taxi, we can take all combinations of passenger locations and destination locations to get the total number of states for our taxi environment; there are 4 destinations and five (4+1) passenger locations.
Thus our taxi environment has 5×5×5×4=500 possible states. (Taxi location – 5×5, passenger location – 5, destination location – 4)
3. Action Space
The action space is the set of all actions that our agent can take in a given state. The agent encounters one of 500 states and performs an action. In our case, this could be movement in a certain direction or a decision to pick up/drop off passenger.
In other words, we have six possible actions:
south
north
east
west
pickup
(passenger selection)dropoff
(passenger disembarkation)
In the illustration above, the taxi is unable to perform certain actions in certain states due to walls. In the environment code, we will provide a -1 penalty for each collision with a wall, which will increase the penalties and make the taxi think about going around the wall.
Fortunately, in OpenAI Gym there is precisely such an environment that has already been created for us.
OpenAI Gym provides various game environments that we can plug into our code and test the agent. The library will take care of the API to provide all the information the agent needs, such as possible actions, score, and current state. We only need to focus on the algorithmic part for our agent.
We will be using the Gym environment called Taxi-V2
from which all the details described above are taken.
Gym interface
First we need to install gym
. Running the following code should work:
!pip install cmake 'gym[atari]' scipy
After installation, we can load the game environment and visualize its appearance:
import gym
env = gym.make("Taxi-v2").env
env.render()
The main interface of gym is envwhich represents unified environment interface.
Below are the methods env
which we will encounter in the code.
env.reset
: Resets the environment and returns a random initial state.
env.step(action)
: Moves the environment one time step. Returns:
observation: Environmental Observations
reward: Whether your action was beneficial or not
done: Shows whether we successfully picked up and dropped off a passenger, also called one episode
info: Additional information such as performance and latency for debugging purposes
env.render
: Renders one frame of the environment (useful for rendering the environment)
Here is our restructured problem statement (from Gym documentation):
There are 4 locations (marked with different letters) and our job is to pick up the passenger at one location and drop him off at another. For a successful landing we receive +20 points and lose 1 point for each time step. For illegal actions in picking up and disembarking passengers, 10 penalty points are awarded.
Let's take a closer look at the environment…
env.reset() # сбросить окружение на новое, случайное состояние
env.render()
print("Action Space {}".format(env.action_space))
print("State Space {}".format(env.observation_space))
Action Space Discrete(6)
State Space Discrete(500)
Filled square represents a taxi that has yellow color without passenger and green with a passenger.
A vertical bar ('|') indicates a wall that a taxi cannot cross.
R, G, Y, B – these are possible places pick-up/drop-off passengers. Blue letter denotes the current location for picking up passengers, and purple – destination.
We have an Action Space of size 6 and a State Space of size 500. RL learns to select an action number from possible actions 0-5, where:
0-south
1-north
2-east
3-west
4-pickup
5-dropoff
RL studies mappings states on optimal actionwhich must be performed in this state, by research, i.e. the agent explores the environment and takes actions based on the rewards defined in that environment.
The optimal action for each state is the action that has highest cumulative long-term reward.
Let's return to our illustration…
Let's take our illustration, encode its state and pass it to the environment to render in gym
. Let us remember that our taxi is located inrow 3, column 1, our passenger is at location 2, and our destination is at location 0. Using the state encoding method Taxi-V2
we can do the following:
state = env.encode(3, 1, 2, 0)
#(строка такси, столбец такси, индекс пассажира, индекс пункта назначения)
print("State:", state)
env.s = state
env.render()
State: 328
We use the coordinates of our illustration to generate a number corresponding to the state between 0 and 499which for the state of our illustration turns out to be 328.
We can then set the environment state manually with env.env.s
, using this encoded number. You can play with the numbers and see how the taxi, the passenger and the destination move.
Reward table
When creating an environment Taxi
an initial reward table is created, called P
. We can represent it as a matrix in which the number of states are rows and the number of actions are columnsthat is state matrix × action matrix.
Since each state is in this matrix, we can see the default reward values assigned to the state of our illustration:
env.P[328]
Element output:
{0: [(1.0, 428, -1, False)],
1: [(1.0, 228, -1, False)],
2: [(1.0, 348, -1, False)],
3: [(1.0, 328, -1, False)],
4: [(1.0, 328, -10, False)],
5: [(1.0, 328, -10, False)]}
This dictionary has the structure {action: [(probability, nextstate, reward, done)]}
.
A few points to note:
0-5 corresponds to actions (south, north, east, west, pickup, dropoff) that the taxi can perform in its current state.
In this environment
probability
always equals 1.0.nextstate
is the state in which we will find ourselves if we perform an action on a given dictionary index.All movement actions have a reward -1and actions on pickup/dropoff have an award -10 in this particular state. If we are in a state where the taxi has a passenger and it is on top of the desired destination, we will receive a reward 20 for the landing action (5).
done
used to let us know that we have successfully dropped off the passenger at the correct location. Every successful disembarkation is the end of the episode.
Note that if our agent had chosen action two (2) in this state, he would have gone east, crashing into the wall. The source code made it impossible for the taxi to actually move through the wall, so if the taxi chooses this action, it will simply continue to apply -1 penalties, which affects for a long term reward.
Implementing Q-Learning in Python
We will use a simple RL algorithm called Q-learningwhich will give our agent some memory.
The original author has separate article on Q-learning
Agent training
First we initialize the Q-table with a 500x6500x6 matrix of zeros:
import numpy as np
q_table = np.zeros([env.observation_space.n, env.action_space.n])
We can now create a learning algorithm that will update this Q-table as the agent explores the environment over time. thousand episodes.
%%time
"""Training the agent"""
import random
from IPython.display import clear_output
# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1
# For plotting metrics
all_epochs = []
all_penalties = []
for i in range(1, 100001):
state = env.reset()
epochs, penalties, reward, = 0, 0, 0
done = False
while not done:
if random.uniform(0, 1) < epsilon:
action = env.action_space.sample() # Explore action space
else:
action = np.argmax(q_table[state]) # Exploit learned values
next_state, reward, done, info = env.step(action)
old_value = q_table[state, action]
next_max = np.max(q_table[next_state])
new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
q_table[state, action] = new_value
if reward == -10:
penalties += 1
state = next_state
epochs += 1
if i % 100 == 0:
clear_output(wait=True)
print(f"Episode: {i}")
print("Training finished.\n")
In the first part while not done
we're deciding, whether to choose a random action or use already calculated Q-values. The value used for this is epsilon
and compares with the function random.uniform(0, 1)
which returns an arbitrary number between 0 and 1.
We perform the selected action in the environment to get next_state
And reward
from performing an action. After this, we calculate the maximum Q-value for the actions corresponding to next_state
and with its help we can easily update our Q-value to new_q_value
Output of the result:
Episode: 100000
Training finished.
Wall time: 30.6 s
Now that the Q-table is created based on 100,000 episodes, let's see what the Q-value is in the state of our illustration:
q_table[328]
Conclusion:
array([ -2.30108105, -1.97092096, -2.30357004, -2.20591839,
-10.3607344 , -8.5583017 ])
The maximum Q-value is “north” (-1.971), so it looks like Q-learning has effectively learned the best action to take in our illustration condition!
Agent rating
It's time to evaluate the work of our agent…
"""Evaluate agent's performance after Q-learning"""
total_epochs, total_penalties = 0, 0
episodes = 100
for _ in range(episodes):
state = env.reset()
epochs, penalties, reward = 0, 0, 0
done = False
while not done:
action = np.argmax(q_table[state])
state, reward, done, info = env.step(action)
if reward == -10:
penalties += 1
epochs += 1
total_penalties += penalties
total_epochs += epochs
print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average penalties per episode: {total_penalties / episodes}")
Output of the result:
Results after 100 episodes:
Average timesteps per episode: 12.3
Average penalties per episode: 0.0
We can see that the agent's performance has improved significantly and he has not incurred any penalties, which means that he performed the correct steps to board/disembark 100 different passengers.
Comparison of our Q-Learning agent with an agent without RL
Let's see how much better our Q-learning solution is compared to the agent, just making random moves.
The original author has an article and about implementation without RL
In Q-learning, the agent initially makes mistakes during exploration, but after it has learned enough about a question, it can act intelligently, maximizing rewards by making smart moves.
We evaluate our agents based on the following indicators:
Average number of fines per episode (the fewer the better)
Average number of time steps per turn (the fewer the better)
Average number of rewards per turn (the higher the better)
Looks like our Q-Learning agent got the job done!
Tuning Hyperparameters
Values alpha, gamma, And epsilon were based mostly on intuition and trial runs, but there are better ways to get good values.
Ideally, all three values should decrease over time because the agent creates more stable judgments as it learns. Therefore, it is extremely important to play and experiment with hyperparameters.
Conclusion
Great! We started with understanding Reinforcement Learning using real analogies. We then dived into the basics of reinforcement learning and formulated the self-driving taxi problem as a reinforcement learning problem using OpenAI's Gym on python
to provide us with an appropriate environment in which we can develop our agent and evaluate it.
We then noticed how terrible our agent was without using any algorithm to play the game, so we proceeded to implement the algorithm Q-Learning from scratch. Agent performance has improved significantly since implementation Q-Learning.
Q-Learning is one of the simplest Reinforcement Learning algorithms. However, the problem with Q-Learning is that when the number of states in the environment becomes very large, it becomes difficult to implement them using a Q-table as its size becomes very, very large.
Modern methods use deep neural networks instead of Q-table (Deep Reinforcement Learning). The neural network receives state and action information to the input layer and learns to output the correct actions over time.
If you want to continue working on this project and make it better, here are a few things you can add –
Turn this code into a module of functions that can be used by multiple environments
Tune alpha, gamma and/or epsilon using episode decay
Implement grid search to identify the best hyperparameters
It turns out that Reinforcement Learning is a type of machine learning that is data hungry even more than Supervised Learning. Getting enough data for Reinforcement Learning algorithms is very difficult.
—Andrew Ng
Since you have read to the end, I will be glad to see you in my tg channel. Actively extracted from essay by Paul Graham (located by #pg) 1-2 daily. I also write about my EdTech startup and various technical materials that I find in the process of work. I don't force anyone 🙂