Briefly about the Q-learning algorithm and how it is implemented in Python for beginners
Principle of operation
Agent selects actions data-driven from Q-tablein which initially all values are equal to zero and thus at each step…. Stop… what kind of Q-table?
Q-table is a matrix, where the rows indicate possible states of the environment, and the columns indicate possible actions of the agent in these states. The values in the Q-table reflect the expected total reward for performing an action in a given state and following the optimal policy thereafter.
At first learning Q-table initialized to zeros or random values. This represents an initial guess about how good different actions are in different states.
After each agent step, the Q-values in the table are updated using the Bellman equation (more on that later).
The agent selects actions based on values from the Q-table, often using ε-greedy strategy, which balances between exploring new actions and exploiting already known actions with maximum Q-value.
Let's say there is a simple environment with four possible states S1, S2, S3, S4 and two possible actions A1, A2 in every condition. A Q table for this environment would look something like this:
State/Action | A1 | A2 |
---|---|---|
S1 | 0.5 | 0.2 |
S2 | 0.1 | -0.4 |
S3 | -0.2 | 0.0 |
S4 | 0.6 | 0.3 |
The values in the table indicate the expected total reward for choosing action A1 or A2 in the corresponding state. For example, a value of 0.5 in cell S1, A1 means that the expected total reward for choosing action A1 in state S1 is 0.5. Over time, as the agent learns, these values will be updated to reflect experience gained and refine estimates of expected rewards.
So…
At each step, the agent updates the Q-table using Bellman equationtaking into account the awards received.
The Bellman equation is used to calculate Q-scores, which are estimates of the quality or utility of performing a particular action from a particular state. These Q-values are then used to determine the agent's optimal strategy to maximize the total expected reward.
The Bellman equation for Q-values can be expressed as:
Where:
Q(s, a) — expected Q-value of state and action,
r – immediate reward received after performing an action a from the state s,
γ — discount factor, which represents the importance of future rewards (usually a value between 0 and 1),
And max and everything after it is the maximum Q-value for all possible actions a' from the next state s'.
The Bellman equation essentially allows an agent to learn to estimate the long-term consequences of its actions, taking into account the current reward and expected future rewards.
In this way, the relationship between the Q-values of current and future states can be recursively revealed.
Implementation of the algorithm in Python
First example with pure nampay, for a 3×3 grid problem:
import numpy as np
# инициализация Q-таблицы
Q = np.zeros([3, 3])
# параметры алгоритма
alpha = 0.1 # коэф. обучения
gamma = 0.9 # коэф. дисконтирования
epsilon = 0.1 # параметр исследования vs. эксплуатации
# основной цикл
for episode in range(1000): # колво эпизодов обучения
state = np.random.randint(0, 3) # стартовое состояние
while state != 2: # пока не достигнута цель
# выбор действия
if np.random.rand() < epsilon:
action = np.random.randint(0, 3) # случайное действие
else:
action = np.argmax(Q[state])
# переход в новое состояние и получение награды
new_state = action
reward = -1 if new_state != 2 else 0 # награда -1 за каждый шаг, 0 за достижение цели
# обновление Q-значения
Q[state, action] = (1 - alpha) * Q[state, action] + alpha * (reward + gamma * np.max(Q[new_state]))
state = new_state
print("Q-таблица:")
print(Q)
Q-таблица:
[[-0.6861894 -0.71757046 0. ]
[-0.74581342 -0.81469798 0. ]
[ 0. 0. 0. ]]
Q-learning for a simple environment Gridworld:
import numpy as np
# Определение среды (Gridworld)
class Gridworld:
def __init__(self):
self.grid_size = (4, 4)
self.num_actions = 4
self.goal_state = (3, 3)
def reset(self):
self.agent_pos = (0, 0)
return self.agent_pos
def step(self, action):
if action == 0: # Вверх
self.agent_pos = (max(0, self.agent_pos[0] - 1), self.agent_pos[1])
elif action == 1: # Вниз
self.agent_pos = (min(self.grid_size[0] - 1, self.agent_pos[0] + 1), self.agent_pos[1])
elif action == 2: # Влево
self.agent_pos = (self.agent_pos[0], max(0, self.agent_pos[1] - 1))
elif action == 3: # Вправо
self.agent_pos = (self.agent_pos[0], min(self.grid_size[1] - 1, self.agent_pos[1] + 1))
reward = -1 if self.agent_pos != self.goal_state else 0
done = self.agent_pos == self.goal_state
return self.agent_pos, reward, done
# Инициализация Q-таблицы
Q = np.zeros((4, 4, 4))
# Параметры обучения
alpha = 0.1
gamma = 0.99
epsilon = 0.1
num_episodes = 1000
# Обучение
env = Gridworld()
for i in range(num_episodes):
state = env.reset()
done = False
while not done:
if np.random.rand() < epsilon:
action = np.random.randint(0, env.num_actions) # Случайное действие
else:
action = np.argmax(Q[state[0], state[1], :]) # Действие с максимальным Q-значением
next_state, reward, done = env.step(action)
Q[state[0], state[1], action] = (1 - alpha) * Q[state[0], state[1], action] + alpha * (reward + gamma * np.max(Q[next_state[0], next_state[1], :]))
state = next_state
print("Q-таблица:")
print(Q)
Q-таблица:
[[[-5.63733656 -4.90099172 -5.78723972 -4.90099167]
[-4.53658863 -3.9403968 -5.34673741 -3.94039681]
[-3.58282726 -2.97009883 -3.97937951 -2.97009888]
[-2.37438071 -1.98999995 -2.93539914 -2.21899042]]
[[-5.27339855 -3.9403967 -4.51348278 -3.94039674]
[-3.86562705 -2.97009964 -4.10058156 -2.97009964]
[-2.75481593 -1.98999997 -2.33378391 -1.98999997]
[-2.12960968 -1. -2.06496437 -1.56168583]]
[[-4.18589727 -2.97009877 -3.28672302 -2.97009878]
[-3.13003944 -1.98999995 -3.10226658 -1.98999994]
[-2.51335353 -1. -2.29706595 -1. ]
[-1.5997738 0. -1.40803817 -0.81469798]]
[[-2.87489189 -2.15624595 -2.15161794 -1.98999995]
[-1.23354402 -1.45941412 -2.08135222 -1. ]
[-1.25371027 -0.77123208 -1.45474141 0. ]
[ 0. 0. 0. 0. ]]]
Also often implemented through the gym lib from OpenAI. For example with the taxi environment:
import gym
import numpy as np
env = gym.make('Taxi-v3')
alpha = 0.1
gamma = 0.99
epsilon = 0.1
# Q-таблица
num_states = env.observation_space.n
num_actions = env.action_space.n
Q = np.zeros((num_states, num_actions))
# обучение агента
num_episodes = 1000
for episode in range(num_episodes):
state = env.reset()
done = False
while not done:
# выбор действия
if np.random.rand() < epsilon:
action = env.action_space.sample() # случайное действие для исследования
else:
action = np.argmax(Q[state]) # действие с наивысшим Q-значением
# выполнение действия и получение награды
next_state, reward, done, _ = env.step(action)
# обновление Q-значения по уравнению Беллмана
Q[state, action] = (1 - alpha) * Q[state, action] + alpha * (reward + gamma * np.max(Q[next_state]))
# переход к следующему состоянию
state = next_state
print("Q-таблица:")
print(Q)
Q-таблица:
[[ 0. 0. 0. 0. 0. 0. ]
[-4.37743896 -4.30379883 -4.52970509 -4.37685397 1.29064633 -7.46319523]
[-1.66879694 -0.44507461 -2.07826517 -1.23688895 13.8078767 -2.43668637]
...
[-1.07785638 -0.74885187 -0.99551198 -0.99053731 -1. -1. ]
[-2.74269569 -2.73474617 -2.73781084 -2.76714114 -3.95870558 -4.68360176]
[-0.1999 -0.19 -0.1 6.47055995 -1.9 -1. ]]
That's all. Let me remind you that within online courses from OTUS experts you will be able to master practical tools for working with data and more. also in calendar of events you can register for any one you like free webinar.