Briefly about the Q-learning algorithm and how it is implemented in Python for beginners

Principle of operation

Agent selects actions data-driven from Q-tablein which initially all values ​​are equal to zero and thus at each step…. Stop… what kind of Q-table?

Q-table is a matrix, where the rows indicate possible states of the environment, and the columns indicate possible actions of the agent in these states. The values ​​in the Q-table reflect the expected total reward for performing an action in a given state and following the optimal policy thereafter.

At first learning Q-table initialized to zeros or random values. This represents an initial guess about how good different actions are in different states.

After each agent step, the Q-values ​​in the table are updated using the Bellman equation (more on that later).

The agent selects actions based on values ​​from the Q-table, often using ε-greedy strategy, which balances between exploring new actions and exploiting already known actions with maximum Q-value.

Let's say there is a simple environment with four possible states S1, S2, S3, S4 and two possible actions A1, A2 in every condition. A Q table for this environment would look something like this:

State/Action

A1

A2

S1

0.5

0.2

S2

0.1

-0.4

S3

-0.2

0.0

S4

0.6

0.3

The values ​​in the table indicate the expected total reward for choosing action A1 or A2 in the corresponding state. For example, a value of 0.5 in cell S1, A1 means that the expected total reward for choosing action A1 in state S1 is 0.5. Over time, as the agent learns, these values ​​will be updated to reflect experience gained and refine estimates of expected rewards.

So…

At each step, the agent updates the Q-table using Bellman equationtaking into account the awards received.

The Bellman equation is used to calculate Q-scores, which are estimates of the quality or utility of performing a particular action from a particular state. These Q-values ​​are then used to determine the agent's optimal strategy to maximize the total expected reward.

The Bellman equation for Q-values ​​can be expressed as:

Q(s, a) = r + \gamma \max_{a'}Q(s', a')

Where:

  • Q(s, a) — expected Q-value of state and action,

  • r – immediate reward received after performing an action a from the state s,

  • γ — discount factor, which represents the importance of future rewards (usually a value between 0 and 1),

  • And max and everything after it is the maximum Q-value for all possible actions a' from the next state s'.

The Bellman equation essentially allows an agent to learn to estimate the long-term consequences of its actions, taking into account the current reward and expected future rewards.

In this way, the relationship between the Q-values ​​of current and future states can be recursively revealed.

Implementation of the algorithm in Python

First example with pure nampay, for a 3×3 grid problem:

import numpy as np

# инициализация Q-таблицы
Q = np.zeros([3, 3])

# параметры алгоритма
alpha = 0.1  # коэф. обучения
gamma = 0.9  # коэф. дисконтирования
epsilon = 0.1  # параметр исследования vs. эксплуатации

# основной цикл 
for episode in range(1000):  # колво эпизодов обучения
    state = np.random.randint(0, 3)  # стартовое состояние
    
    while state != 2:  # пока не достигнута цель
        # выбор действия
        if np.random.rand() < epsilon:
            action = np.random.randint(0, 3)  # случайное действие
        else:
            action = np.argmax(Q[state]) 
        
        # переход в новое состояние и получение награды
        new_state = action
        reward = -1 if new_state != 2 else 0  # награда -1 за каждый шаг, 0 за достижение цели
        
        # обновление Q-значения
        Q[state, action] = (1 - alpha) * Q[state, action] + alpha * (reward + gamma * np.max(Q[new_state]))
        
        state = new_state

print("Q-таблица:")
print(Q)
Q-таблица:
[[-0.6861894  -0.71757046  0.        ]
 [-0.74581342 -0.81469798  0.        ]
 [ 0.          0.          0.        ]]

Q-learning for a simple environment Gridworld:

import numpy as np

# Определение среды (Gridworld)
class Gridworld:
    def __init__(self):
        self.grid_size = (4, 4)
        self.num_actions = 4
        self.goal_state = (3, 3)
    
    def reset(self):
        self.agent_pos = (0, 0)
        return self.agent_pos
    
    def step(self, action):
        if action == 0:  # Вверх
            self.agent_pos = (max(0, self.agent_pos[0] - 1), self.agent_pos[1])
        elif action == 1:  # Вниз
            self.agent_pos = (min(self.grid_size[0] - 1, self.agent_pos[0] + 1), self.agent_pos[1])
        elif action == 2:  # Влево
            self.agent_pos = (self.agent_pos[0], max(0, self.agent_pos[1] - 1))
        elif action == 3:  # Вправо
            self.agent_pos = (self.agent_pos[0], min(self.grid_size[1] - 1, self.agent_pos[1] + 1))
        
        reward = -1 if self.agent_pos != self.goal_state else 0
        done = self.agent_pos == self.goal_state
        return self.agent_pos, reward, done

# Инициализация Q-таблицы
Q = np.zeros((4, 4, 4))

# Параметры обучения
alpha = 0.1
gamma = 0.99
epsilon = 0.1
num_episodes = 1000

# Обучение
env = Gridworld()
for i in range(num_episodes):
    state = env.reset()
    done = False
    while not done:
        if np.random.rand() < epsilon:
            action = np.random.randint(0, env.num_actions)  # Случайное действие
        else:
            action = np.argmax(Q[state[0], state[1], :])  # Действие с максимальным Q-значением
        
        next_state, reward, done = env.step(action)
        Q[state[0], state[1], action] = (1 - alpha) * Q[state[0], state[1], action] + alpha * (reward + gamma * np.max(Q[next_state[0], next_state[1], :]))
        
        state = next_state

print("Q-таблица:")
print(Q)
Q-таблица:
[[[-5.63733656 -4.90099172 -5.78723972 -4.90099167]
  [-4.53658863 -3.9403968  -5.34673741 -3.94039681]
  [-3.58282726 -2.97009883 -3.97937951 -2.97009888]
  [-2.37438071 -1.98999995 -2.93539914 -2.21899042]]

 [[-5.27339855 -3.9403967  -4.51348278 -3.94039674]
  [-3.86562705 -2.97009964 -4.10058156 -2.97009964]
  [-2.75481593 -1.98999997 -2.33378391 -1.98999997]
  [-2.12960968 -1.         -2.06496437 -1.56168583]]

 [[-4.18589727 -2.97009877 -3.28672302 -2.97009878]
  [-3.13003944 -1.98999995 -3.10226658 -1.98999994]
  [-2.51335353 -1.         -2.29706595 -1.        ]
  [-1.5997738   0.         -1.40803817 -0.81469798]]

 [[-2.87489189 -2.15624595 -2.15161794 -1.98999995]
  [-1.23354402 -1.45941412 -2.08135222 -1.        ]
  [-1.25371027 -0.77123208 -1.45474141  0.        ]
  [ 0.          0.          0.          0.        ]]]

Also often implemented through the gym lib from OpenAI. For example with the taxi environment:

import gym
import numpy as np

env = gym.make('Taxi-v3')

alpha = 0.1 
gamma = 0.99  
epsilon = 0.1  

# Q-таблица
num_states = env.observation_space.n
num_actions = env.action_space.n
Q = np.zeros((num_states, num_actions))

# обучение агента
num_episodes = 1000
for episode in range(num_episodes):
    state = env.reset()
    done = False
    while not done:
        # выбор действия
        if np.random.rand() < epsilon:
            action = env.action_space.sample()  # случайное действие для исследования
        else:
            action = np.argmax(Q[state])  # действие с наивысшим Q-значением
            
        # выполнение действия и получение награды
        next_state, reward, done, _ = env.step(action)
        
        # обновление Q-значения по уравнению Беллмана
        Q[state, action] = (1 - alpha) * Q[state, action] + alpha * (reward + gamma * np.max(Q[next_state]))
        
        # переход к следующему состоянию
        state = next_state
    
print("Q-таблица:")
print(Q)
Q-таблица:
[[ 0.          0.          0.          0.          0.          0.        ]
 [-4.37743896 -4.30379883 -4.52970509 -4.37685397  1.29064633 -7.46319523]
 [-1.66879694 -0.44507461 -2.07826517 -1.23688895 13.8078767  -2.43668637]
 ...
 [-1.07785638 -0.74885187 -0.99551198 -0.99053731 -1.         -1.        ]
 [-2.74269569 -2.73474617 -2.73781084 -2.76714114 -3.95870558 -4.68360176]
 [-0.1999     -0.19       -0.1         6.47055995 -1.9        -1.        ]]

That's all. Let me remind you that within online courses from OTUS experts you will be able to master practical tools for working with data and more. also in calendar of events you can register for any one you like free webinar.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *