A guide to Reinforcement Learning for beginners. Implementing a simple task

Disclaimer: The author of the translation is new to ML/DS, so he is open to any edits/additions to the article

Content:

  1. What is reinforcement learning?

  2. Development of a self-driving car from scratch (Smart Cab)

    1. Rewards

    2. State Space

    3. Action Space

  3. Gym interface

  4. Reward table

  5. Implementing Q-Learning in Python

    1. Agent training

    2. Agent rating

  6. Comparison of a Q-Learning agent with an agent without RL

  7. Conclusion


What is reinforcement learning?

Most of you have probably heard that artificial intelligence has learned to play computer games on its own. A very popular example is the company Deepmind, which made news and took the world by storm when its program AlphaGo defeated the world Go champion from South Korea in 2016.

So what is the secret of this monumental breakthrough? Hold your horses! You'll understand this in a couple of minutes.

Reinforcement learning analogy

Reinforcement learning analogy

Let's look at the analogy of teaching a dog new tricks. In this scenario, we simulate a situation and the dog tries to react to it in different ways. If the dog's reaction is desired, we reward it with food. Otherwise, we make it clear in one way or another that her reaction is not the desired one.

Now, every time the dog finds itself in the same situation, it performs the same action with even more enthusiasm, expecting to receive more food. Essentially, she learns what to do based on positive experiences. Likewise, she will learn what not to do when faced with negative experiences.

That's exactly how it works Reinforcement Learning in a broad sense

And if you put it all together…

Task Reinforcement Learning (RL) is learning agentwhich interacts with surroundings. The agent moves between different environmental scenarios called statesperforming actions. Actions, in turn, bring rewardwhich can be positive, negative or zero.

The agent's sole goal is to maximize the concept of cumulative reward per episode, that is, everything that happens between the initial and final state, where we determine the reward corresponding to the tasks we want to solve.

Thus, we induce the agent to perform certain actions by providing him with a positive reward, and to refuse other actions by providing him with a negative reward. This is how the agent learns to develop a strategy or policy.

Illustration of the RL process

Illustration of the RL process

R.L. – one of three main machine learning paradigmsalong with supervised and unsupervised learning.

Two important points to note…

  1. Greed doesn't always help
    There are things that are easy to do for instant gratification, and there are those that provide long-term rewards. The goal is not to be greedy, looking only for quick immediate rewards, but to optimize them for maximum rewards throughout the training.

  2. Consistency matters in reinforcement learning
    The agent's reward depends not only on the current state, but also on the entire history of states. Unlike supervised and unsupervised learning, time is of the essence here.

In a sense, RL is the science of making optimal decisions based on experience. If you break it down, the process includes the following simple steps:

  1. Environmental monitoring

  2. Deciding how to act using a particular strategy

  3. Act accordingly

  4. Receiving reward or punishment

  5. Learn from experience and improve strategy

  6. Iterate until the optimal strategy is found

returning to AlphaGo…

AlphaGo – this is a classic example RL agent, who learned to play the game of Go to maximize his reward. In this lesson we will look at RL using the example of developing an agent that will independently learn andplay the game automatically.

RL is not limited to just games! It is used for managing stock portfolios and finances, for creating humanoid robots, for manufacturing and inventory management, for developing general purpose AI agents, etc…

Development of a self-driving car from scratch (Smart Cab)

Let's develop a simulation of a self-driving taxi (Smart Cab). The main goal is to demonstrate in a simplified environment how RL techniques can be used to develop an efficient and safe approach to solve this problem.

The task of Smart Cab is to pick up a passenger in one place and drop him off at another. What we want our Smart Cab to take care of:

When modeling an RL solution to this problem, various aspects need to be taken into account: rewards, states and actions.

1. Rewards

Because the agent (imaginary driver) is reward motivated and is going to learn how to operate the cab through trial and error, we need to determine reward and/or punishment and their size accordingly. Here are a few points to consider:

2. State Space

In RL, an agent encounters a state and then takes action according to the state it is in.

The state space is the set of all possible situations in which our taxi may find itself. The state must contain useful information necessary for the agent to perform the correct action.

Let's say we have a training ground for a Smart Cab where we teach it to transport people in a parking lot to four different locations (R, G, Y, B):

Smart Cab presentation in a parking lot

Smart Cab presentation in a parking lot

Let's assume that Smart Cab is the only car in this parking lot. We can divide the parking lot into a 5×5 grid, giving us 25 possible taxi locations. These 25 places are one part of our state space. The current location state of our taxi is (3, 1).

There are 4 places where we can pick up and drop off passengers: R, G, Y, B or [(0,0), (0,4), (4,0), (4,3)] in coordinates (row, col). Our passenger is in place Y and wants to go to place R.

If we have 1 additional state of a passenger being in a taxi, we can take all combinations of passenger locations and destination locations to get the total number of states for our taxi environment; there are 4 destinations and five (4+1) passenger locations.

Thus our taxi environment has 5×5×5×4=500 possible states. (Taxi location – 5×5, passenger location – 5, destination location – 4)

3. Action Space

The action space is the set of all actions that our agent can take in a given state. The agent encounters one of 500 states and performs an action. In our case, this could be movement in a certain direction or a decision to pick up/drop off passenger.

In other words, we have six possible actions:

  1. south

  2. north

  3. east

  4. west

  5. pickup (passenger selection)

  6. dropoff (passenger disembarkation)

In the illustration above, the taxi is unable to perform certain actions in certain states due to walls. In the environment code, we will provide a -1 penalty for each collision with a wall, which will increase the penalties and make the taxi think about going around the wall.

Fortunately, in OpenAI Gym there is precisely such an environment that has already been created for us.

OpenAI Gym provides various game environments that we can plug into our code and test the agent. The library will take care of the API to provide all the information the agent needs, such as possible actions, score, and current state. We only need to focus on the algorithmic part for our agent.

We will be using the Gym environment called Taxi-V2from which all the details described above are taken.

Gym interface

First we need to install gym. Running the following code should work:

!pip install cmake 'gym[atari]' scipy

After installation, we can load the game environment and visualize its appearance:

import gym  
env = gym.make("Taxi-v2").env  
env.render()

The main interface of gym is envwhich represents unified environment interface.

Below are the methods envwhich we will encounter in the code.

env.reset: Resets the environment and returns a random initial state.

env.step(action): Moves the environment one time step. Returns:

Here is our restructured problem statement (from Gym documentation):

There are 4 locations (marked with different letters) and our job is to pick up the passenger at one location and drop him off at another. For a successful landing we receive +20 points and lose 1 point for each time step. For illegal actions in picking up and disembarking passengers, 10 penalty points are awarded.

Let's take a closer look at the environment…

env.reset() # сбросить окружение на новое, случайное состояние  
env.render()  
print("Action Space {}".format(env.action_space)) 
print("State Space {}".format(env.observation_space))
Action Space Discrete(6)
State Space Discrete(500)
  • Filled square represents a taxi that has yellow color without passenger and green with a passenger.

  • A vertical bar ('|') indicates a wall that a taxi cannot cross.

  • R, G, Y, B – these are possible places pick-up/drop-off passengers. Blue letter denotes the current location for picking up passengers, and purple – destination.

We have an Action Space of size 6 and a State Space of size 500. RL learns to select an action number from possible actions 0-5, where:

  • 0-south

  • 1-north

  • 2-east

  • 3-west

  • 4-pickup

  • 5-dropoff

RL studies mappings states on optimal actionwhich must be performed in this state, by research, i.e. the agent explores the environment and takes actions based on the rewards defined in that environment.

The optimal action for each state is the action that has highest cumulative long-term reward.

Let's return to our illustration…

Let's take our illustration, encode its state and pass it to the environment to render in gym. Let us remember that our taxi is located inrow 3, column 1, our passenger is at location 2, and our destination is at location 0. Using the state encoding method Taxi-V2we can do the following:

state = env.encode(3, 1, 2, 0) 
#(строка такси, столбец такси, индекс пассажира, индекс пункта назначения)
print("State:", state)
env.s = state
env.render()
State: 328

We use the coordinates of our illustration to generate a number corresponding to the state between 0 and 499which for the state of our illustration turns out to be 328.

We can then set the environment state manually with env.env.s, using this encoded number. You can play with the numbers and see how the taxi, the passenger and the destination move.

Reward table

When creating an environment Taxi an initial reward table is created, called P. We can represent it as a matrix in which the number of states are rows and the number of actions are columnsthat is state matrix × action matrix.

Since each state is in this matrix, we can see the default reward values ​​assigned to the state of our illustration:

env.P[328]

Element output:

{0: [(1.0, 428, -1, False)],
 1: [(1.0, 228, -1, False)],
 2: [(1.0, 348, -1, False)],
 3: [(1.0, 328, -1, False)],
 4: [(1.0, 328, -10, False)],
 5: [(1.0, 328, -10, False)]}

This dictionary has the structure {action: [(probability, nextstate, reward, done)]}.

A few points to note:

Note that if our agent had chosen action two (2) in this state, he would have gone east, crashing into the wall. The source code made it impossible for the taxi to actually move through the wall, so if the taxi chooses this action, it will simply continue to apply -1 penalties, which affects for a long term reward.

Implementing Q-Learning in Python

We will use a simple RL algorithm called Q-learningwhich will give our agent some memory.

The original author has separate article on Q-learning

Agent training

First we initialize the Q-table with a 500x6500x6 matrix of zeros:

import numpy as np
q_table = np.zeros([env.observation_space.n, env.action_space.n])

We can now create a learning algorithm that will update this Q-table as the agent explores the environment over time. thousand episodes.

%%time
"""Training the agent"""
import random
from IPython.display import clear_output
# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1
# For plotting metrics
all_epochs = []
all_penalties = []
for i in range(1, 100001):
    state = env.reset()
epochs, penalties, reward, = 0, 0, 0
    done = False
    
    while not done:
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample() # Explore action space
        else:
            action = np.argmax(q_table[state]) # Exploit learned values
next_state, reward, done, info = env.step(action) 
        
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        
        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        q_table[state, action] = new_value
if reward == -10:
            penalties += 1
state = next_state
        epochs += 1
        
    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")
print("Training finished.\n")

In the first part while not done we're deciding, whether to choose a random action or use already calculated Q-values. The value used for this is epsilon and compares with the function random.uniform(0, 1)which returns an arbitrary number between 0 and 1.

We perform the selected action in the environment to get next_state And reward from performing an action. After this, we calculate the maximum Q-value for the actions corresponding to next_stateand with its help we can easily update our Q-value to new_q_value

Output of the result:

Episode: 100000
Training finished.
Wall time: 30.6 s

Now that the Q-table is created based on 100,000 episodes, let's see what the Q-value is in the state of our illustration:

q_table[328]

Conclusion:

array([ -2.30108105,  -1.97092096,  -2.30357004,  -2.20591839,
       -10.3607344 ,  -8.5583017 ])

The maximum Q-value is “north” (-1.971), so it looks like Q-learning has effectively learned the best action to take in our illustration condition!

Agent rating

It's time to evaluate the work of our agent…

"""Evaluate agent's performance after Q-learning"""
total_epochs, total_penalties = 0, 0
episodes = 100
for _ in range(episodes):
    state = env.reset()
    epochs, penalties, reward = 0, 0, 0
    
    done = False
    
    while not done:
        action = np.argmax(q_table[state])
        state, reward, done, info = env.step(action)
if reward == -10:
            penalties += 1
epochs += 1
total_penalties += penalties
    total_epochs += epochs
print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average penalties per episode: {total_penalties / episodes}")

Output of the result:

Results after 100 episodes:
Average timesteps per episode: 12.3
Average penalties per episode: 0.0

We can see that the agent's performance has improved significantly and he has not incurred any penalties, which means that he performed the correct steps to board/disembark 100 different passengers.

Comparison of our Q-Learning agent with an agent without RL

Let's see how much better our Q-learning solution is compared to the agent, just making random moves.

The original author has an article and about implementation without RL

In Q-learning, the agent initially makes mistakes during exploration, but after it has learned enough about a question, it can act intelligently, maximizing rewards by making smart moves.

We evaluate our agents based on the following indicators:

  • Average number of fines per episode (the fewer the better)

  • Average number of time steps per turn (the fewer the better)

  • Average number of rewards per turn (the higher the better)

Comparison of both of our agents (with and without RL)

Comparison of both of our agents (with and without RL)

Looks like our Q-Learning agent got the job done!

Tuning Hyperparameters

Values alpha, gamma, And epsilon were based mostly on intuition and trial runs, but there are better ways to get good values.

Ideally, all three values ​​should decrease over time because the agent creates more stable judgments as it learns. Therefore, it is extremely important to play and experiment with hyperparameters.

Conclusion

Great! We started with understanding Reinforcement Learning using real analogies. We then dived into the basics of reinforcement learning and formulated the self-driving taxi problem as a reinforcement learning problem using OpenAI's Gym on pythonto provide us with an appropriate environment in which we can develop our agent and evaluate it.

We then noticed how terrible our agent was without using any algorithm to play the game, so we proceeded to implement the algorithm Q-Learning from scratch. Agent performance has improved significantly since implementation Q-Learning.

Q-Learning is one of the simplest Reinforcement Learning algorithms. However, the problem with Q-Learning is that when the number of states in the environment becomes very large, it becomes difficult to implement them using a Q-table as its size becomes very, very large.

Modern methods use deep neural networks instead of Q-table (Deep Reinforcement Learning). The neural network receives state and action information to the input layer and learns to output the correct actions over time.

If you want to continue working on this project and make it better, here are a few things you can add –

It turns out that Reinforcement Learning is a type of machine learning that is data hungry even more than Supervised Learning. Getting enough data for Reinforcement Learning algorithms is very difficult.

—Andrew Ng


Since you have read to the end, I will be glad to see you in my tg channel. Actively extracted from essay by Paul Graham (located by #pg) 1-2 daily. I also write about my EdTech startup and various technical materials that I find in the process of work. I don't force anyone 🙂

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *