ch329 — Reinforcement Learning Foundations - Mathematics for Programmers

1. The RL setting¶

Reinforcement learning trains an agent to take actions in an environment to maximise cumulative reward. Unlike supervised learning, there are no labels — only delayed scalar feedback.

Formal framework (Markov Decision Process):

State space $\mathcal{S}$ , action space $\mathcal{A}$
Transition: $P(s' | s, a)$ — environment dynamics
Reward: $r(s, a)$ — scalar signal
Policy: $\pi(a|s)$ — agent’s decision rule
Return: $G_t = \sum_{k=0}^\infty \gamma^k r_{t+k}$ — discounted sum of future rewards

(Markov chains: ch258. Expected value: ch249. Dynamic programming: intuition from ch079.)

2. The value function¶

The value function $V^\pi(s)$ is the expected return starting from state $s$ under policy $\pi$ :

V^\pi(s) = \mathbb{E}_\pi[G_t | s_t = s] = \mathbb{E}_\pi\left[r_t + \gamma V^\pi(s_{t+1}) | s_t = s\right]

(1)

The second form is the Bellman equation — the value of a state equals immediate reward plus discounted value of the next state. This recursive structure enables dynamic programming.

3. Q-Learning and DQN¶

The Q-function (action-value function):

Q^\pi(s,a) = \mathbb{E}_\pi[G_t | s_t=s, a_t=a]

(2)

The optimal policy takes the action that maximises $Q^*(s,a)$ . Q-learning learns $Q^*$ directly:

Q(s,a) \leftarrow Q(s,a) + \alpha\left[r + \gamma \max_{a'} Q(s',a') - Q(s,a)\right]

(3)

DQN (Mnih et al., 2015) uses a neural network to approximate $Q(s,a;\theta)$ . Two key tricks: experience replay (store and resample transitions) and target network (separate slowly-updated copy for stable targets).

import numpy as np
import matplotlib.pyplot as plt
from collections import deque


class GridWorld:
    """5x5 grid. Agent starts at (0,0), goal at (4,4), hole at (2,2)."""
    SIZE = 5
    GOAL = (4, 4)
    HOLE = (2, 2)
    ACTIONS = [(0,1),(0,-1),(1,0),(-1,0)]  # right, left, down, up

    def reset(self):
        self.pos = [0, 0]
        return tuple(self.pos)

    def step(self, action: int):
        dr, dc = self.ACTIONS[action]
        r = min(max(self.pos[0]+dr, 0), self.SIZE-1)
        c = min(max(self.pos[1]+dc, 0), self.SIZE-1)
        self.pos = [r, c]
        state = tuple(self.pos)
        if state == self.GOAL: return state, 1.0, True
        if state == self.HOLE: return state, -1.0, True
        return state, -0.01, False  # small step penalty


class TabularQAgent:
    """Q-learning with a table (states are discrete)."""
    def __init__(self, n_states: int, n_actions: int,
                 lr: float = 0.1, gamma: float = 0.99, eps: float = 1.0):
        self.Q = np.zeros((n_states, n_actions))
        self.lr = lr; self.gamma = gamma; self.eps = eps; self.eps_min = 0.05

    def state_to_idx(self, state: tuple) -> int:
        return state[0] * 5 + state[1]

    def act(self, state: tuple, rng: np.random.Generator) -> int:
        if rng.random() < self.eps:
            return rng.integers(0, 4)
        return int(np.argmax(self.Q[self.state_to_idx(state)]))

    def update(self, s, a, r, s_next, done):
        si, sni = self.state_to_idx(s), self.state_to_idx(s_next)
        td_target = r if done else r + self.gamma * np.max(self.Q[sni])
        self.Q[si, a] += self.lr * (td_target - self.Q[si, a])

    def decay_eps(self, decay: float = 0.995):
        self.eps = max(self.eps_min, self.eps * decay)


rng = np.random.default_rng(0)
env = GridWorld()
agent = TabularQAgent(25, 4)

episode_returns = []
for ep in range(800):
    s = env.reset()
    G = 0.0; done = False; steps = 0
    while not done and steps < 100:
        a = agent.act(s, rng)
        s_next, r, done = env.step(a)
        agent.update(s, a, r, s_next, done)
        s = s_next; G += r; steps += 1
    agent.decay_eps()
    episode_returns.append(G)

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Smoothed returns
window = 30
smoothed = np.convolve(episode_returns, np.ones(window)/window, mode='valid')
ax1.plot(episode_returns, alpha=0.3, color='#3498db', lw=1)
ax1.plot(smoothed, color='#e74c3c', lw=2, label=f'{window}-ep moving avg')
ax1.set_title('Episode returns during Q-learning'); ax1.legend()
ax1.set_xlabel('Episode'); ax1.set_ylabel('Return')

# Value function heatmap
V = agent.Q.max(axis=1).reshape(5, 5)
im = ax2.imshow(V, cmap='RdYlGn', vmin=V.min(), vmax=V.max())
ax2.set_title('Learned V(s) = max_a Q(s,a)')
plt.colorbar(im, ax=ax2)
ax2.set_xticks([]); ax2.set_yticks([])
ax2.text(4, 4, 'G', ha='center', va='center', fontsize=14, fontweight='bold')
ax2.text(2, 2, 'H', ha='center', va='center', fontsize=14, fontweight='bold', color='white')

plt.tight_layout()
plt.savefig('ch329_rl.png', dpi=120)
plt.show()

# Evaluate greedy policy
successes = 0
for _ in range(100):
    s = env.reset()
    done = False; steps = 0
    while not done and steps < 30:
        a = int(np.argmax(agent.Q[agent.state_to_idx(s)]))
        s, r, done = env.step(a)
        steps += 1
    if s == GridWorld.GOAL: successes += 1
print(f"Greedy policy success rate: {successes}% (over 100 eval episodes)")

4. The exploration-exploitation tradeoff¶

With probability $\varepsilon$ the agent takes a random action (exploration); otherwise it takes the greedy action (exploitation). Decaying $\varepsilon$ during training gradually shifts from exploration to exploitation.

5. Key RL algorithms¶

Algorithm	Policy type	Value function	Notes
Q-learning	Implicit (greedy)	$Q(s,a)$	Off-policy, tabular
DQN	Implicit	Neural $Q$	Experience replay, target net
REINFORCE	Explicit $\pi_\theta$	None	High variance
Actor-Critic	Explicit	$V_\theta(s)$	Reduced variance
PPO	Explicit + clipped	$V_\theta$	Safe, stable, widely used

6. Summary¶

RL: agent maximises expected cumulative reward by learning a policy.
MDP framework: states, actions, transitions, rewards, discount factor.
Q-learning: tabular Bellman updates converge to $Q^*$ for discrete finite MDPs.
DQN: neural Q-function approximation for high-dimensional state spaces.
Exploration (ε-greedy) is essential; no exploration → no learning.

7. Forward and backward references¶

Used here: Markov chains (ch258), expected value (ch249), Bellman recursion connects to dynamic programming (ch079), neural function approximation (ch301).

This will reappear in ch330 — Policy Gradients, where the policy itself is parameterised and trained directly with gradient ascent on expected return.