1. The RL setting¶
Reinforcement learning trains an agent to take actions in an environment to maximise cumulative reward. Unlike supervised learning, there are no labels — only delayed scalar feedback.
Formal framework (Markov Decision Process):
State space , action space
Transition: — environment dynamics
Reward: — scalar signal
Policy: — agent’s decision rule
Return: — discounted sum of future rewards
(Markov chains: ch258. Expected value: ch249. Dynamic programming: intuition from ch079.)
2. The value function¶
The value function is the expected return starting from state under policy :
The second form is the Bellman equation — the value of a state equals immediate reward plus discounted value of the next state. This recursive structure enables dynamic programming.
3. Q-Learning and DQN¶
The Q-function (action-value function):
The optimal policy takes the action that maximises . Q-learning learns directly:
DQN (Mnih et al., 2015) uses a neural network to approximate . Two key tricks: experience replay (store and resample transitions) and target network (separate slowly-updated copy for stable targets).
import numpy as np
import matplotlib.pyplot as plt
from collections import deque
class GridWorld:
"""5x5 grid. Agent starts at (0,0), goal at (4,4), hole at (2,2)."""
SIZE = 5
GOAL = (4, 4)
HOLE = (2, 2)
ACTIONS = [(0,1),(0,-1),(1,0),(-1,0)] # right, left, down, up
def reset(self):
self.pos = [0, 0]
return tuple(self.pos)
def step(self, action: int):
dr, dc = self.ACTIONS[action]
r = min(max(self.pos[0]+dr, 0), self.SIZE-1)
c = min(max(self.pos[1]+dc, 0), self.SIZE-1)
self.pos = [r, c]
state = tuple(self.pos)
if state == self.GOAL: return state, 1.0, True
if state == self.HOLE: return state, -1.0, True
return state, -0.01, False # small step penalty
class TabularQAgent:
"""Q-learning with a table (states are discrete)."""
def __init__(self, n_states: int, n_actions: int,
lr: float = 0.1, gamma: float = 0.99, eps: float = 1.0):
self.Q = np.zeros((n_states, n_actions))
self.lr = lr; self.gamma = gamma; self.eps = eps; self.eps_min = 0.05
def state_to_idx(self, state: tuple) -> int:
return state[0] * 5 + state[1]
def act(self, state: tuple, rng: np.random.Generator) -> int:
if rng.random() < self.eps:
return rng.integers(0, 4)
return int(np.argmax(self.Q[self.state_to_idx(state)]))
def update(self, s, a, r, s_next, done):
si, sni = self.state_to_idx(s), self.state_to_idx(s_next)
td_target = r if done else r + self.gamma * np.max(self.Q[sni])
self.Q[si, a] += self.lr * (td_target - self.Q[si, a])
def decay_eps(self, decay: float = 0.995):
self.eps = max(self.eps_min, self.eps * decay)
rng = np.random.default_rng(0)
env = GridWorld()
agent = TabularQAgent(25, 4)
episode_returns = []
for ep in range(800):
s = env.reset()
G = 0.0; done = False; steps = 0
while not done and steps < 100:
a = agent.act(s, rng)
s_next, r, done = env.step(a)
agent.update(s, a, r, s_next, done)
s = s_next; G += r; steps += 1
agent.decay_eps()
episode_returns.append(G)
# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
# Smoothed returns
window = 30
smoothed = np.convolve(episode_returns, np.ones(window)/window, mode='valid')
ax1.plot(episode_returns, alpha=0.3, color='#3498db', lw=1)
ax1.plot(smoothed, color='#e74c3c', lw=2, label=f'{window}-ep moving avg')
ax1.set_title('Episode returns during Q-learning'); ax1.legend()
ax1.set_xlabel('Episode'); ax1.set_ylabel('Return')
# Value function heatmap
V = agent.Q.max(axis=1).reshape(5, 5)
im = ax2.imshow(V, cmap='RdYlGn', vmin=V.min(), vmax=V.max())
ax2.set_title('Learned V(s) = max_a Q(s,a)')
plt.colorbar(im, ax=ax2)
ax2.set_xticks([]); ax2.set_yticks([])
ax2.text(4, 4, 'G', ha='center', va='center', fontsize=14, fontweight='bold')
ax2.text(2, 2, 'H', ha='center', va='center', fontsize=14, fontweight='bold', color='white')
plt.tight_layout()
plt.savefig('ch329_rl.png', dpi=120)
plt.show()
# Evaluate greedy policy
successes = 0
for _ in range(100):
s = env.reset()
done = False; steps = 0
while not done and steps < 30:
a = int(np.argmax(agent.Q[agent.state_to_idx(s)]))
s, r, done = env.step(a)
steps += 1
if s == GridWorld.GOAL: successes += 1
print(f"Greedy policy success rate: {successes}% (over 100 eval episodes)")4. The exploration-exploitation tradeoff¶
With probability the agent takes a random action (exploration); otherwise it takes the greedy action (exploitation). Decaying during training gradually shifts from exploration to exploitation.
5. Key RL algorithms¶
| Algorithm | Policy type | Value function | Notes |
|---|---|---|---|
| Q-learning | Implicit (greedy) | Off-policy, tabular | |
| DQN | Implicit | Neural | Experience replay, target net |
| REINFORCE | Explicit | None | High variance |
| Actor-Critic | Explicit | Reduced variance | |
| PPO | Explicit + clipped | Safe, stable, widely used |
6. Summary¶
RL: agent maximises expected cumulative reward by learning a policy.
MDP framework: states, actions, transitions, rewards, discount factor.
Q-learning: tabular Bellman updates converge to for discrete finite MDPs.
DQN: neural Q-function approximation for high-dimensional state spaces.
Exploration (ε-greedy) is essential; no exploration → no learning.
7. Forward and backward references¶
Used here: Markov chains (ch258), expected value (ch249), Bellman recursion connects to dynamic programming (ch079), neural function approximation (ch301).
This will reappear in ch330 — Policy Gradients, where the policy itself is parameterised and trained directly with gradient ascent on expected return.