Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

ch329 — Reinforcement Learning Foundations

1. The RL setting

Reinforcement learning trains an agent to take actions in an environment to maximise cumulative reward. Unlike supervised learning, there are no labels — only delayed scalar feedback.

Formal framework (Markov Decision Process):

  • State space S\mathcal{S}, action space A\mathcal{A}

  • Transition: P(ss,a)P(s' | s, a) — environment dynamics

  • Reward: r(s,a)r(s, a) — scalar signal

  • Policy: π(as)\pi(a|s) — agent’s decision rule

  • Return: Gt=k=0γkrt+kG_t = \sum_{k=0}^\infty \gamma^k r_{t+k} — discounted sum of future rewards

(Markov chains: ch258. Expected value: ch249. Dynamic programming: intuition from ch079.)


2. The value function

The value function Vπ(s)V^\pi(s) is the expected return starting from state ss under policy π\pi:

Vπ(s)=Eπ[Gtst=s]=Eπ[rt+γVπ(st+1)st=s]V^\pi(s) = \mathbb{E}_\pi[G_t | s_t = s] = \mathbb{E}_\pi\left[r_t + \gamma V^\pi(s_{t+1}) | s_t = s\right]

The second form is the Bellman equation — the value of a state equals immediate reward plus discounted value of the next state. This recursive structure enables dynamic programming.


3. Q-Learning and DQN

The Q-function (action-value function):

Qπ(s,a)=Eπ[Gtst=s,at=a]Q^\pi(s,a) = \mathbb{E}_\pi[G_t | s_t=s, a_t=a]

The optimal policy takes the action that maximises Q(s,a)Q^*(s,a). Q-learning learns QQ^* directly:

Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha\left[r + \gamma \max_{a'} Q(s',a') - Q(s,a)\right]

DQN (Mnih et al., 2015) uses a neural network to approximate Q(s,a;θ)Q(s,a;\theta). Two key tricks: experience replay (store and resample transitions) and target network (separate slowly-updated copy for stable targets).

import numpy as np
import matplotlib.pyplot as plt
from collections import deque


class GridWorld:
    """5x5 grid. Agent starts at (0,0), goal at (4,4), hole at (2,2)."""
    SIZE = 5
    GOAL = (4, 4)
    HOLE = (2, 2)
    ACTIONS = [(0,1),(0,-1),(1,0),(-1,0)]  # right, left, down, up

    def reset(self):
        self.pos = [0, 0]
        return tuple(self.pos)

    def step(self, action: int):
        dr, dc = self.ACTIONS[action]
        r = min(max(self.pos[0]+dr, 0), self.SIZE-1)
        c = min(max(self.pos[1]+dc, 0), self.SIZE-1)
        self.pos = [r, c]
        state = tuple(self.pos)
        if state == self.GOAL: return state, 1.0, True
        if state == self.HOLE: return state, -1.0, True
        return state, -0.01, False  # small step penalty


class TabularQAgent:
    """Q-learning with a table (states are discrete)."""
    def __init__(self, n_states: int, n_actions: int,
                 lr: float = 0.1, gamma: float = 0.99, eps: float = 1.0):
        self.Q = np.zeros((n_states, n_actions))
        self.lr = lr; self.gamma = gamma; self.eps = eps; self.eps_min = 0.05

    def state_to_idx(self, state: tuple) -> int:
        return state[0] * 5 + state[1]

    def act(self, state: tuple, rng: np.random.Generator) -> int:
        if rng.random() < self.eps:
            return rng.integers(0, 4)
        return int(np.argmax(self.Q[self.state_to_idx(state)]))

    def update(self, s, a, r, s_next, done):
        si, sni = self.state_to_idx(s), self.state_to_idx(s_next)
        td_target = r if done else r + self.gamma * np.max(self.Q[sni])
        self.Q[si, a] += self.lr * (td_target - self.Q[si, a])

    def decay_eps(self, decay: float = 0.995):
        self.eps = max(self.eps_min, self.eps * decay)


rng = np.random.default_rng(0)
env = GridWorld()
agent = TabularQAgent(25, 4)

episode_returns = []
for ep in range(800):
    s = env.reset()
    G = 0.0; done = False; steps = 0
    while not done and steps < 100:
        a = agent.act(s, rng)
        s_next, r, done = env.step(a)
        agent.update(s, a, r, s_next, done)
        s = s_next; G += r; steps += 1
    agent.decay_eps()
    episode_returns.append(G)

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Smoothed returns
window = 30
smoothed = np.convolve(episode_returns, np.ones(window)/window, mode='valid')
ax1.plot(episode_returns, alpha=0.3, color='#3498db', lw=1)
ax1.plot(smoothed, color='#e74c3c', lw=2, label=f'{window}-ep moving avg')
ax1.set_title('Episode returns during Q-learning'); ax1.legend()
ax1.set_xlabel('Episode'); ax1.set_ylabel('Return')

# Value function heatmap
V = agent.Q.max(axis=1).reshape(5, 5)
im = ax2.imshow(V, cmap='RdYlGn', vmin=V.min(), vmax=V.max())
ax2.set_title('Learned V(s) = max_a Q(s,a)')
plt.colorbar(im, ax=ax2)
ax2.set_xticks([]); ax2.set_yticks([])
ax2.text(4, 4, 'G', ha='center', va='center', fontsize=14, fontweight='bold')
ax2.text(2, 2, 'H', ha='center', va='center', fontsize=14, fontweight='bold', color='white')

plt.tight_layout()
plt.savefig('ch329_rl.png', dpi=120)
plt.show()

# Evaluate greedy policy
successes = 0
for _ in range(100):
    s = env.reset()
    done = False; steps = 0
    while not done and steps < 30:
        a = int(np.argmax(agent.Q[agent.state_to_idx(s)]))
        s, r, done = env.step(a)
        steps += 1
    if s == GridWorld.GOAL: successes += 1
print(f"Greedy policy success rate: {successes}% (over 100 eval episodes)")

4. The exploration-exploitation tradeoff

With probability ε\varepsilon the agent takes a random action (exploration); otherwise it takes the greedy action (exploitation). Decaying ε\varepsilon during training gradually shifts from exploration to exploitation.


5. Key RL algorithms

AlgorithmPolicy typeValue functionNotes
Q-learningImplicit (greedy)Q(s,a)Q(s,a)Off-policy, tabular
DQNImplicitNeural QQExperience replay, target net
REINFORCEExplicit πθ\pi_\thetaNoneHigh variance
Actor-CriticExplicitVθ(s)V_\theta(s)Reduced variance
PPOExplicit + clippedVθV_\thetaSafe, stable, widely used

6. Summary

  • RL: agent maximises expected cumulative reward by learning a policy.

  • MDP framework: states, actions, transitions, rewards, discount factor.

  • Q-learning: tabular Bellman updates converge to QQ^* for discrete finite MDPs.

  • DQN: neural Q-function approximation for high-dimensional state spaces.

  • Exploration (ε-greedy) is essential; no exploration → no learning.


7. Forward and backward references

Used here: Markov chains (ch258), expected value (ch249), Bellman recursion connects to dynamic programming (ch079), neural function approximation (ch301).

This will reappear in ch330 — Policy Gradients, where the policy itself is parameterised and trained directly with gradient ascent on expected return.