ch238 — Stochastic Gradient Descent Theory - Mathematics for Programmers

Advanced Calculus Experiment 8.

Why does gradient descent converge even when we only compute the gradient on a tiny random batch? This experiment builds the theoretical intuition: SGD (ch227) is a stochastic process, and its convergence guarantees come from probability theory (Part VIII) as much as from calculus.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# Ground truth: sum of quadratics, simulating many loss terms
n_data = 10000
A = np.random.randn(n_data) * 2 + 1  # each 'data item' contributes a_i*(w-b_i)^2
B = np.random.randn(n_data)

# Full loss: L(w) = (1/n) sum_i a_i * (w - b_i)^2
# Minimiser: w* = sum(a_i * b_i) / sum(a_i)
# Full gradient: dL/dw = (2/n) sum_i a_i * (w - b_i)

def full_loss(w): return np.mean(A * (w - B)**2)
def full_grad(w): return 2 * np.mean(A * (w - B))
def stochastic_grad(w, batch_idx): return 2 * np.mean(A[batch_idx] * (w - B[batch_idx]))

w_star = np.sum(A * B) / np.sum(A)
print(f"True minimum: w* = {w_star:.6f}")
print(f"Full loss at w*: {full_loss(w_star):.8f}")

# Compare batch sizes: 1, 16, 256, full
batch_sizes = [1, 16, 256, n_data]
n_iters = 2000
lr = 0.1

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

for bs, color in zip(batch_sizes, ['red', 'orange', 'green', 'blue']):
    w = 3.0  # start far from minimum
    losses, errors = [], []
    for i in range(n_iters):
        batch = np.random.choice(n_data, min(bs, n_data), replace=False)
        g = stochastic_grad(w, batch)
        w = w - lr * g
        losses.append(full_loss(w))
        errors.append(abs(w - w_star))
    label = f'batch={bs}'
    axes[0].semilogy(losses, color=color, lw=1.2, alpha=0.8, label=label)
    axes[1].semilogy(errors, color=color, lw=1.2, alpha=0.8, label=label)

for ax, title in zip(axes, ['Full Loss vs Iteration', '|w - w*| vs Iteration']):
    ax.set_xlabel('Iteration'); ax.set_ylabel('(log scale)')
    ax.set_title(title); ax.legend(); ax.grid(True, which='both', alpha=0.3)

plt.suptitle('SGD Convergence by Batch Size', fontsize=12)
plt.tight_layout(); plt.savefig('ch238_sgd_theory.png', dpi=100); plt.show()

Gradient Noise and Generalisation¶

Smaller batch sizes inject more noise into gradient estimates. This noise, counterintuitively, often leads to better generalisation — SGD with small batches tends to find flatter minima, which generalise better than the sharp minima found by full-batch GD.

This is still an active research area. The empirical observation is consistent but the theoretical explanation is disputed.

# Simulate gradient noise variance vs batch size
w0 = w_star + 0.5  # off minimum
grad_variances = {}

for bs in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]:
    samples_of_grad = []
    for _ in range(500):
        batch = np.random.choice(n_data, bs, replace=False)
        samples_of_grad.append(stochastic_grad(w0, batch))
    grad_variances[bs] = np.var(samples_of_grad)

bs_vals = list(grad_variances.keys())
var_vals = list(grad_variances.values())

plt.figure(figsize=(8, 4))
plt.loglog(bs_vals, var_vals, 'ro-', lw=2)
# Expected: variance ~ sigma^2 / batch_size (CLT)
clt_ref = var_vals[0] * np.array(bs_vals[0]) / np.array(bs_vals)
plt.loglog(bs_vals, clt_ref, 'b--', lw=1.5, label='O(1/batch_size) from CLT')
plt.xlabel('Batch size'); plt.ylabel('Gradient variance')
plt.title('Gradient Noise Scales as 1/batch_size (Central Limit Theorem)')
plt.legend(); plt.grid(True, which='both', alpha=0.3)
plt.tight_layout(); plt.savefig('ch238_noise.png', dpi=100); plt.show()

Summary¶

Batch size	Gradient noise	Steps to converge	Generalisation
1 (pure SGD)	High	Many (but fast per step)	Often best
Small (32–256)	Medium	Moderate	Good
Full batch	Zero	Few (slow per step)	Sometimes overfit

Forward reference: ch280 — Hypothesis Testing (Part IX) provides formal tools for reasoning about whether observed differences in model performance are statistically significant.