ch208 — Automatic Differentiation - Mathematics for Programmers

Part VII: Calculus

1. The Problem with Numerical Derivatives at Scale¶

Numerical differentiation (ch207) requires 2N function evaluations to estimate the gradient of a function with N parameters (using centered differences). A neural network with 10 million parameters would need 20 million forward passes per gradient step.

Automatic differentiation (AD) computes the exact gradient in roughly the cost of 1 forward pass, regardless of the number of parameters.

This is not numerical approximation. It is exact differentiation, exploiting the chain rule applied to the elementary operations the computer actually executes.

2. The Key Insight: Computation Graphs and the Chain Rule¶

Every computation is a composition of elementary operations: +, ×, exp, sin, etc. Each has a known derivative. The chain rule (introduced in ch205, deepened in ch215) tells us how to combine them.

Consider: y = sin(x² + 1)

Decompose into elementary steps:

a = x²           da/dx = 2x
b = a + 1         db/da = 1
y = sin(b)        dy/db = cos(b)

By the chain rule:

dy/dx = (dy/db) · (db/da) · (da/dx)
      = cos(b)  ·    1    ·   2x
      = cos(x²+1) · 2x

AD automates this bookkeeping.

import numpy as np

# ---------------------------------------------------------------
# Minimal forward-mode automatic differentiation using dual numbers
# A dual number: (value, derivative) pair
# Operations propagate both the value and the derivative simultaneously
# ---------------------------------------------------------------

class Dual:
    """Dual number for forward-mode automatic differentiation.
    
    x = Dual(value, derivative)
    Arithmetic propagates derivatives via the chain rule.
    """
    def __init__(self, val, deriv=0.0):
        self.val   = float(val)
        self.deriv = float(deriv)
    
    def __add__(self, other):
        if isinstance(other, (int, float)):
            other = Dual(other, 0.0)
        return Dual(self.val + other.val, self.deriv + other.deriv)
    
    def __radd__(self, other): return self.__add__(other)
    
    def __mul__(self, other):
        if isinstance(other, (int, float)):
            other = Dual(other, 0.0)
        # Product rule: (f·g)' = f'g + fg'
        return Dual(
            self.val * other.val,
            self.deriv * other.val + self.val * other.deriv
        )
    
    def __rmul__(self, other): return self.__mul__(other)
    
    def __pow__(self, n):
        # Power rule: (x^n)' = n·x^(n-1)
        return Dual(self.val**n, n * self.val**(n-1) * self.deriv)
    
    def __sub__(self, other):
        if isinstance(other, (int, float)):
            other = Dual(other, 0.0)
        return Dual(self.val - other.val, self.deriv - other.deriv)
    
    def __truediv__(self, other):
        if isinstance(other, (int, float)):
            other = Dual(other, 0.0)
        # Quotient rule
        return Dual(
            self.val / other.val,
            (self.deriv * other.val - self.val * other.deriv) / other.val**2
        )
    
    def __repr__(self):
        return f'Dual(val={self.val:.6f}, deriv={self.deriv:.6f})'


# Extend numpy-style functions to dual numbers
def dual_sin(x):
    return Dual(np.sin(x.val), np.cos(x.val) * x.deriv)

def dual_cos(x):
    return Dual(np.cos(x.val), -np.sin(x.val) * x.deriv)

def dual_exp(x):
    return Dual(np.exp(x.val), np.exp(x.val) * x.deriv)

def dual_log(x):
    return Dual(np.log(x.val), (1.0 / x.val) * x.deriv)


# Test: f(x) = sin(x^2 + 1), f'(x) = cos(x^2+1) * 2x
def f(x):
    return dual_sin(x**2 + 1)

x_val = 1.5
x = Dual(x_val, 1.0)   # seed derivative = 1 to compute df/dx
result = f(x)

# Analytical derivative
analytical = np.cos(x_val**2 + 1) * 2 * x_val

print(f'f(x) = sin(x² + 1)  at x = {x_val}')
print(f'  AD value:       {result.val:.8f}')
print(f'  Analytical val: {np.sin(x_val**2 + 1):.8f}')
print(f'  AD derivative:  {result.deriv:.8f}')
print(f'  Analytical der: {analytical:.8f}')
print(f'  Error:          {abs(result.deriv - analytical):.2e}')

f(x) = sin(x² + 1)  at x = 1.5
  AD value:       -0.10819513
  Analytical val: -0.10819513
  AD derivative:  -2.98238903
  Analytical der: -2.98238903
  Error:          0.00e+00

# More complex example: MSE loss for a 1-layer network
# L(w) = mean((sigmoid(x*w) - y)^2)
# Compare AD to numerical

def dual_sigmoid(x):
    s = 1.0 / (1.0 + np.exp(-x.val))
    return Dual(s, s * (1 - s) * x.deriv)

# One data point
x_data = 2.0
y_data = 0.8

def loss_fn_ad(w_dual):
    y_hat = dual_sigmoid(w_dual * x_data)
    residual = y_hat - y_data
    return residual * residual  # MSE for 1 point

def loss_fn_scalar(w):
    y_hat = 1 / (1 + np.exp(-w * x_data))
    return (y_hat - y_data)**2

w_test = 0.5
w_dual = Dual(w_test, 1.0)
result = loss_fn_ad(w_dual)

# Numerical gradient
h = 1e-7
num_grad = (loss_fn_scalar(w_test + h) - loss_fn_scalar(w_test - h)) / (2*h)

print('MSE loss for 1-layer sigmoid network:')
print(f'  w = {w_test}, x = {x_data}, y = {y_data}')
print(f'  Loss:             {result.val:.8f}')
print(f'  AD gradient:      {result.deriv:.8f}')
print(f'  Numerical grad:   {num_grad:.8f}')
print(f'  Error:            {abs(result.deriv - num_grad):.2e}')

MSE loss for 1-layer sigmoid network:
  w = 0.5, x = 2.0, y = 0.8
  Loss:             0.00475292
  AD gradient:      -0.05421882
  Numerical grad:   -0.05421882
  Error:            3.14e-11

3. Forward vs Reverse Mode¶

The dual number approach above is forward-mode AD: we propagate derivatives forward from inputs to outputs.

Mode	Best for	Cost
Forward	Few inputs, many outputs	O(N inputs) passes
Reverse	Many inputs, few outputs	O(M outputs) passes

Neural networks have millions of inputs (weights) and typically one output (scalar loss). So reverse-mode AD — also called backpropagation — is the standard. It computes gradients w.r.t. all weights in a single backward pass.

(Backpropagation as a specific algorithm is the subject of ch216.)

# Minimal reverse-mode AD: a computation graph with backward pass
# Each node stores its value and knows how to accumulate gradient

class Value:
    """A scalar value in a computation graph with reverse-mode AD."""
    
    def __init__(self, data, _children=(), _op='', label=''):
        self.data  = float(data)
        self.grad  = 0.0  # will be filled during backward pass
        self._backward = lambda: None  # no-op by default
        self._prev = set(_children)
        self._op   = _op
        self.label = label
    
    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad  += 1.0 * out.grad  # d(self+other)/d(self) = 1
            other.grad += 1.0 * out.grad  # d(self+other)/d(other) = 1
        out._backward = _backward
        return out
    
    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad  += other.data * out.grad  # product rule
            other.grad += self.data  * out.grad
        out._backward = _backward
        return out
    
    def __pow__(self, n):
        out = Value(self.data**n, (self,), f'**{n}')
        def _backward():
            self.grad += n * self.data**(n-1) * out.grad
        out._backward = _backward
        return out
    
    def exp(self):
        e = np.exp(self.data)
        out = Value(e, (self,), 'exp')
        def _backward():
            self.grad += e * out.grad
        out._backward = _backward
        return out
    
    def sigmoid(self):
        s = 1.0 / (1.0 + np.exp(-self.data))
        out = Value(s, (self,), 'sigmoid')
        def _backward():
            self.grad += s * (1 - s) * out.grad
        out._backward = _backward
        return out
    
    def __radd__(self, other): return self.__add__(other)
    def __rmul__(self, other): return self.__mul__(other)
    def __neg__(self): return self * -1
    def __sub__(self, other): return self + (-other if isinstance(other, Value) else Value(-other))
    
    def backward(self):
        """Topological sort then reverse-mode gradient accumulation."""
        topo = []
        visited = set()
        def build_topo(v):
            if id(v) not in visited:
                visited.add(id(v))
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
        self.grad = 1.0
        for v in reversed(topo):
            v._backward()
    
    def __repr__(self):
        return f'Value(data={self.data:.4f}, grad={self.grad:.4f})'


# Example: y = sigmoid(w * x + b), loss = (y - target)^2
x = Value(2.0, label='x')
w = Value(0.5, label='w')
b = Value(-0.1, label='b')
target = Value(0.8, label='target')

y_hat = (w * x + b).sigmoid()
loss  = (y_hat - target) ** 2

loss.backward()

print('Reverse-mode AD: 1-layer sigmoid network')
print(f'  y_hat = sigmoid(w*x + b) = {y_hat.data:.6f}')
print(f'  loss  = (y_hat - target)^2 = {loss.data:.6f}')
print()
print(f'  grad w.r.t. w: {w.grad:.8f}')
print(f'  grad w.r.t. b: {b.grad:.8f}')

# Verify with centered difference
h = 1e-7
def scalar_loss(wv, bv):
    s = 1/(1+np.exp(-(wv*2.0 + bv)))
    return (s - 0.8)**2

num_grad_w = (scalar_loss(0.5+h, -0.1) - scalar_loss(0.5-h, -0.1)) / (2*h)
num_grad_b = (scalar_loss(0.5, -0.1+h) - scalar_loss(0.5, -0.1-h)) / (2*h)
print()
print(f'  Numerical grad w: {num_grad_w:.8f}  error={abs(w.grad - num_grad_w):.2e}')
print(f'  Numerical grad b: {num_grad_b:.8f}  error={abs(b.grad - num_grad_b):.2e}')

Reverse-mode AD: 1-layer sigmoid network
  y_hat = sigmoid(w*x + b) = 0.710950
  loss  = (y_hat - target)^2 = 0.007930

  grad w.r.t. w: -0.07319962
  grad w.r.t. b: -0.03659981

  Numerical grad w: -0.07319962  error=6.97e-12
  Numerical grad b: -0.03659981  error=3.48e-12

4. Summary¶

Numerical differentiation: O(N) function evaluations per gradient — too slow for large N
Automatic differentiation: exact gradients via chain rule, O(1) backward pass
Forward-mode AD: dual numbers, efficient for few inputs
Reverse-mode AD: computation graph + backward pass, efficient for few outputs (the ML case)
PyTorch, JAX, and TensorFlow all implement reverse-mode AD — the engine running your training loop

5. Forward References¶

The Value class above is a stripped-down version of PyTorch’s autograd engine. The full algorithm — applied to multi-layer networks — is backpropagation, the subject of ch216 — Backpropagation Intuition. The chain rule that makes AD work is formalized in ch215 — Chain Rule. The computation graph structure connects to ch177 — Matrix Calculus Introduction (Part VI), where Jacobians appear.