Part VII: Calculus
1. The Problem with Numerical Derivatives at Scale¶
Numerical differentiation (ch207) requires 2N function evaluations to estimate the gradient of a function with N parameters (using centered differences). A neural network with 10 million parameters would need 20 million forward passes per gradient step.
Automatic differentiation (AD) computes the exact gradient in roughly the cost of 1 forward pass, regardless of the number of parameters.
This is not numerical approximation. It is exact differentiation, exploiting the chain rule applied to the elementary operations the computer actually executes.
2. The Key Insight: Computation Graphs and the Chain Rule¶
Every computation is a composition of elementary operations: +, ×, exp, sin, etc. Each has a known derivative. The chain rule (introduced in ch205, deepened in ch215) tells us how to combine them.
Consider: y = sin(x² + 1)
Decompose into elementary steps:
a = x² da/dx = 2x
b = a + 1 db/da = 1
y = sin(b) dy/db = cos(b)By the chain rule:
dy/dx = (dy/db) · (db/da) · (da/dx)
= cos(b) · 1 · 2x
= cos(x²+1) · 2xAD automates this bookkeeping.
import numpy as np
# ---------------------------------------------------------------
# Minimal forward-mode automatic differentiation using dual numbers
# A dual number: (value, derivative) pair
# Operations propagate both the value and the derivative simultaneously
# ---------------------------------------------------------------
class Dual:
"""Dual number for forward-mode automatic differentiation.
x = Dual(value, derivative)
Arithmetic propagates derivatives via the chain rule.
"""
def __init__(self, val, deriv=0.0):
self.val = float(val)
self.deriv = float(deriv)
def __add__(self, other):
if isinstance(other, (int, float)):
other = Dual(other, 0.0)
return Dual(self.val + other.val, self.deriv + other.deriv)
def __radd__(self, other): return self.__add__(other)
def __mul__(self, other):
if isinstance(other, (int, float)):
other = Dual(other, 0.0)
# Product rule: (f·g)' = f'g + fg'
return Dual(
self.val * other.val,
self.deriv * other.val + self.val * other.deriv
)
def __rmul__(self, other): return self.__mul__(other)
def __pow__(self, n):
# Power rule: (x^n)' = n·x^(n-1)
return Dual(self.val**n, n * self.val**(n-1) * self.deriv)
def __sub__(self, other):
if isinstance(other, (int, float)):
other = Dual(other, 0.0)
return Dual(self.val - other.val, self.deriv - other.deriv)
def __truediv__(self, other):
if isinstance(other, (int, float)):
other = Dual(other, 0.0)
# Quotient rule
return Dual(
self.val / other.val,
(self.deriv * other.val - self.val * other.deriv) / other.val**2
)
def __repr__(self):
return f'Dual(val={self.val:.6f}, deriv={self.deriv:.6f})'
# Extend numpy-style functions to dual numbers
def dual_sin(x):
return Dual(np.sin(x.val), np.cos(x.val) * x.deriv)
def dual_cos(x):
return Dual(np.cos(x.val), -np.sin(x.val) * x.deriv)
def dual_exp(x):
return Dual(np.exp(x.val), np.exp(x.val) * x.deriv)
def dual_log(x):
return Dual(np.log(x.val), (1.0 / x.val) * x.deriv)
# Test: f(x) = sin(x^2 + 1), f'(x) = cos(x^2+1) * 2x
def f(x):
return dual_sin(x**2 + 1)
x_val = 1.5
x = Dual(x_val, 1.0) # seed derivative = 1 to compute df/dx
result = f(x)
# Analytical derivative
analytical = np.cos(x_val**2 + 1) * 2 * x_val
print(f'f(x) = sin(x² + 1) at x = {x_val}')
print(f' AD value: {result.val:.8f}')
print(f' Analytical val: {np.sin(x_val**2 + 1):.8f}')
print(f' AD derivative: {result.deriv:.8f}')
print(f' Analytical der: {analytical:.8f}')
print(f' Error: {abs(result.deriv - analytical):.2e}')f(x) = sin(x² + 1) at x = 1.5
AD value: -0.10819513
Analytical val: -0.10819513
AD derivative: -2.98238903
Analytical der: -2.98238903
Error: 0.00e+00
# More complex example: MSE loss for a 1-layer network
# L(w) = mean((sigmoid(x*w) - y)^2)
# Compare AD to numerical
def dual_sigmoid(x):
s = 1.0 / (1.0 + np.exp(-x.val))
return Dual(s, s * (1 - s) * x.deriv)
# One data point
x_data = 2.0
y_data = 0.8
def loss_fn_ad(w_dual):
y_hat = dual_sigmoid(w_dual * x_data)
residual = y_hat - y_data
return residual * residual # MSE for 1 point
def loss_fn_scalar(w):
y_hat = 1 / (1 + np.exp(-w * x_data))
return (y_hat - y_data)**2
w_test = 0.5
w_dual = Dual(w_test, 1.0)
result = loss_fn_ad(w_dual)
# Numerical gradient
h = 1e-7
num_grad = (loss_fn_scalar(w_test + h) - loss_fn_scalar(w_test - h)) / (2*h)
print('MSE loss for 1-layer sigmoid network:')
print(f' w = {w_test}, x = {x_data}, y = {y_data}')
print(f' Loss: {result.val:.8f}')
print(f' AD gradient: {result.deriv:.8f}')
print(f' Numerical grad: {num_grad:.8f}')
print(f' Error: {abs(result.deriv - num_grad):.2e}')MSE loss for 1-layer sigmoid network:
w = 0.5, x = 2.0, y = 0.8
Loss: 0.00475292
AD gradient: -0.05421882
Numerical grad: -0.05421882
Error: 3.14e-11
3. Forward vs Reverse Mode¶
The dual number approach above is forward-mode AD: we propagate derivatives forward from inputs to outputs.
| Mode | Best for | Cost |
|---|---|---|
| Forward | Few inputs, many outputs | O(N inputs) passes |
| Reverse | Many inputs, few outputs | O(M outputs) passes |
Neural networks have millions of inputs (weights) and typically one output (scalar loss). So reverse-mode AD — also called backpropagation — is the standard. It computes gradients w.r.t. all weights in a single backward pass.
(Backpropagation as a specific algorithm is the subject of ch216.)
# Minimal reverse-mode AD: a computation graph with backward pass
# Each node stores its value and knows how to accumulate gradient
class Value:
"""A scalar value in a computation graph with reverse-mode AD."""
def __init__(self, data, _children=(), _op='', label=''):
self.data = float(data)
self.grad = 0.0 # will be filled during backward pass
self._backward = lambda: None # no-op by default
self._prev = set(_children)
self._op = _op
self.label = label
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other), '+')
def _backward():
self.grad += 1.0 * out.grad # d(self+other)/d(self) = 1
other.grad += 1.0 * out.grad # d(self+other)/d(other) = 1
out._backward = _backward
return out
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other), '*')
def _backward():
self.grad += other.data * out.grad # product rule
other.grad += self.data * out.grad
out._backward = _backward
return out
def __pow__(self, n):
out = Value(self.data**n, (self,), f'**{n}')
def _backward():
self.grad += n * self.data**(n-1) * out.grad
out._backward = _backward
return out
def exp(self):
e = np.exp(self.data)
out = Value(e, (self,), 'exp')
def _backward():
self.grad += e * out.grad
out._backward = _backward
return out
def sigmoid(self):
s = 1.0 / (1.0 + np.exp(-self.data))
out = Value(s, (self,), 'sigmoid')
def _backward():
self.grad += s * (1 - s) * out.grad
out._backward = _backward
return out
def __radd__(self, other): return self.__add__(other)
def __rmul__(self, other): return self.__mul__(other)
def __neg__(self): return self * -1
def __sub__(self, other): return self + (-other if isinstance(other, Value) else Value(-other))
def backward(self):
"""Topological sort then reverse-mode gradient accumulation."""
topo = []
visited = set()
def build_topo(v):
if id(v) not in visited:
visited.add(id(v))
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)
self.grad = 1.0
for v in reversed(topo):
v._backward()
def __repr__(self):
return f'Value(data={self.data:.4f}, grad={self.grad:.4f})'
# Example: y = sigmoid(w * x + b), loss = (y - target)^2
x = Value(2.0, label='x')
w = Value(0.5, label='w')
b = Value(-0.1, label='b')
target = Value(0.8, label='target')
y_hat = (w * x + b).sigmoid()
loss = (y_hat - target) ** 2
loss.backward()
print('Reverse-mode AD: 1-layer sigmoid network')
print(f' y_hat = sigmoid(w*x + b) = {y_hat.data:.6f}')
print(f' loss = (y_hat - target)^2 = {loss.data:.6f}')
print()
print(f' grad w.r.t. w: {w.grad:.8f}')
print(f' grad w.r.t. b: {b.grad:.8f}')
# Verify with centered difference
h = 1e-7
def scalar_loss(wv, bv):
s = 1/(1+np.exp(-(wv*2.0 + bv)))
return (s - 0.8)**2
num_grad_w = (scalar_loss(0.5+h, -0.1) - scalar_loss(0.5-h, -0.1)) / (2*h)
num_grad_b = (scalar_loss(0.5, -0.1+h) - scalar_loss(0.5, -0.1-h)) / (2*h)
print()
print(f' Numerical grad w: {num_grad_w:.8f} error={abs(w.grad - num_grad_w):.2e}')
print(f' Numerical grad b: {num_grad_b:.8f} error={abs(b.grad - num_grad_b):.2e}')Reverse-mode AD: 1-layer sigmoid network
y_hat = sigmoid(w*x + b) = 0.710950
loss = (y_hat - target)^2 = 0.007930
grad w.r.t. w: -0.07319962
grad w.r.t. b: -0.03659981
Numerical grad w: -0.07319962 error=6.97e-12
Numerical grad b: -0.03659981 error=3.48e-12
4. Summary¶
Numerical differentiation: O(N) function evaluations per gradient — too slow for large N
Automatic differentiation: exact gradients via chain rule, O(1) backward pass
Forward-mode AD: dual numbers, efficient for few inputs
Reverse-mode AD: computation graph + backward pass, efficient for few outputs (the ML case)
PyTorch, JAX, and TensorFlow all implement reverse-mode AD — the engine running your training loop
5. Forward References¶
The Value class above is a stripped-down version of PyTorch’s autograd engine. The full algorithm — applied to multi-layer networks — is backpropagation, the subject of ch216 — Backpropagation Intuition. The chain rule that makes AD work is formalized in ch215 — Chain Rule. The computation graph structure connects to ch177 — Matrix Calculus Introduction (Part VI), where Jacobians appear.