Advanced Calculus Experiment 8.
Why does gradient descent converge even when we only compute the gradient on a tiny random batch? This experiment builds the theoretical intuition: SGD (ch227) is a stochastic process, and its convergence guarantees come from probability theory (Part VIII) as much as from calculus.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
# Ground truth: sum of quadratics, simulating many loss terms
n_data = 10000
A = np.random.randn(n_data) * 2 + 1 # each 'data item' contributes a_i*(w-b_i)^2
B = np.random.randn(n_data)
# Full loss: L(w) = (1/n) sum_i a_i * (w - b_i)^2
# Minimiser: w* = sum(a_i * b_i) / sum(a_i)
# Full gradient: dL/dw = (2/n) sum_i a_i * (w - b_i)
def full_loss(w): return np.mean(A * (w - B)**2)
def full_grad(w): return 2 * np.mean(A * (w - B))
def stochastic_grad(w, batch_idx): return 2 * np.mean(A[batch_idx] * (w - B[batch_idx]))
w_star = np.sum(A * B) / np.sum(A)
print(f"True minimum: w* = {w_star:.6f}")
print(f"Full loss at w*: {full_loss(w_star):.8f}")
# Compare batch sizes: 1, 16, 256, full
batch_sizes = [1, 16, 256, n_data]
n_iters = 2000
lr = 0.1
fig, axes = plt.subplots(1, 2, figsize=(13, 5))
for bs, color in zip(batch_sizes, ['red', 'orange', 'green', 'blue']):
w = 3.0 # start far from minimum
losses, errors = [], []
for i in range(n_iters):
batch = np.random.choice(n_data, min(bs, n_data), replace=False)
g = stochastic_grad(w, batch)
w = w - lr * g
losses.append(full_loss(w))
errors.append(abs(w - w_star))
label = f'batch={bs}'
axes[0].semilogy(losses, color=color, lw=1.2, alpha=0.8, label=label)
axes[1].semilogy(errors, color=color, lw=1.2, alpha=0.8, label=label)
for ax, title in zip(axes, ['Full Loss vs Iteration', '|w - w*| vs Iteration']):
ax.set_xlabel('Iteration'); ax.set_ylabel('(log scale)')
ax.set_title(title); ax.legend(); ax.grid(True, which='both', alpha=0.3)
plt.suptitle('SGD Convergence by Batch Size', fontsize=12)
plt.tight_layout(); plt.savefig('ch238_sgd_theory.png', dpi=100); plt.show()
Gradient Noise and Generalisation¶
Smaller batch sizes inject more noise into gradient estimates. This noise, counterintuitively, often leads to better generalisation — SGD with small batches tends to find flatter minima, which generalise better than the sharp minima found by full-batch GD.
This is still an active research area. The empirical observation is consistent but the theoretical explanation is disputed.
# Simulate gradient noise variance vs batch size
w0 = w_star + 0.5 # off minimum
grad_variances = {}
for bs in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]:
samples_of_grad = []
for _ in range(500):
batch = np.random.choice(n_data, bs, replace=False)
samples_of_grad.append(stochastic_grad(w0, batch))
grad_variances[bs] = np.var(samples_of_grad)
bs_vals = list(grad_variances.keys())
var_vals = list(grad_variances.values())
plt.figure(figsize=(8, 4))
plt.loglog(bs_vals, var_vals, 'ro-', lw=2)
# Expected: variance ~ sigma^2 / batch_size (CLT)
clt_ref = var_vals[0] * np.array(bs_vals[0]) / np.array(bs_vals)
plt.loglog(bs_vals, clt_ref, 'b--', lw=1.5, label='O(1/batch_size) from CLT')
plt.xlabel('Batch size'); plt.ylabel('Gradient variance')
plt.title('Gradient Noise Scales as 1/batch_size (Central Limit Theorem)')
plt.legend(); plt.grid(True, which='both', alpha=0.3)
plt.tight_layout(); plt.savefig('ch238_noise.png', dpi=100); plt.show()
Summary¶
| Batch size | Gradient noise | Steps to converge | Generalisation |
|---|---|---|---|
| 1 (pure SGD) | High | Many (but fast per step) | Often best |
| Small (32–256) | Medium | Moderate | Good |
| Full batch | Zero | Few (slow per step) | Sometimes overfit |
Forward reference: ch280 — Hypothesis Testing (Part IX) provides formal tools for reasoning about whether observed differences in model performance are statistically significant.