Part X — Deep Learning - Mathematics for Programmers

What this Part covers and why it matters¶

You have spent nine Parts building the mathematical infrastructure that makes deep learning possible. Every concept in this Part has an explicit debt to what came before:

Deep Learning concept	Mathematical foundation
Forward pass	Matrix multiplication (ch153–155), function composition (ch054)
Backpropagation	Chain rule (ch215), partial derivatives (ch210)
Gradient descent	Optimisation landscapes (ch213), gradient intuition (ch209)
Loss functions	KL divergence (ch295), entropy (ch294)
Embeddings	Vector spaces (ch137), PCA (ch178)
Attention	Dot product (ch131), softmax (ch065)
Generative models	Probability distributions (ch249), Bayes (ch246)

Part X is where the mathematics becomes a system. You will build neural networks from scratch, derive every equation, then scale up to the architectures that power modern AI.

The mental shift¶

In Parts I–IX, functions were things you analysed. In Part X, functions are things you optimise.

A neural network is a parameterised function family. Training is the process of searching that family for the member that minimises a loss. Every “learning” algorithm is ultimately a statement about how to move through parameter space.

The shift:

From solving equations → approximating functions with gradient-guided search
From fixed transformations → learned transformations
From one model → compositions of millions of simple operations

Part map¶

FOUNDATIONS (301–309)
  Perceptron → MLP → Forward Pass → Loss → Backprop → SGD → Init → Activations
        ↓
TRAINING ENGINEERING (310–313)
  BatchNorm → Dropout → Optimisers → LR Schedules
        ↓
ARCHITECTURES (314–323)
  CNNs → RNNs → LSTMs → Seq2Seq → Embeddings → Attention → Transformers → Positional Encoding
        ↓
ADVANCED MODELS (324–330)
  Self-Supervised → VAE → GAN → Diffusion → GNN → RL → Policy Gradients
        ↓
SYSTEMS (331–333)
  Interpretability → Scaling Laws → Training Infrastructure
        ↓
PROJECTS (334–339)
  NumPy Net → CNN → Char-LM → Transformer Block → VAE → RL CartPole
        ↓
CAPSTONE II (340)
  End-to-End Deep Learning System

Prerequisites from prior Parts¶

Required — you must be comfortable with these before proceeding:

Matrix multiplication and shapes (ch153–ch155)
Chain rule and partial derivatives (ch209–ch211, ch215)
Gradient descent (ch212–ch213)
Sigmoid and activation functions (ch064–ch065)
Probability distributions and expectation (ch248–ch250)
Entropy and KL divergence (ch294–ch295)
NumPy vectorised operations (ch147)

Useful but not blocking:

SVD and PCA (ch177–ch178)
Markov chains (ch258)
Information theory (ch293)

Motivating problem: Can you solve this right now?¶

Below is a dataset of 500 points in 2D space sampled from two interleaved spirals. No linear classifier can separate them. No polynomial of fixed degree captures the boundary.

By the end of Part X, you will build the model that solves this from mathematical first principles.

import numpy as np
import matplotlib.pyplot as plt

def make_spirals(n=250, noise=0.4, seed=42):
    rng = np.random.default_rng(seed)
    def spiral_arm(n, delta_theta, label):
        theta = np.linspace(0, 4 * np.pi, n) + delta_theta
        r = theta / (4 * np.pi)
        x = r * np.cos(theta) + rng.normal(0, noise * 0.1, n)
        y = r * np.sin(theta) + rng.normal(0, noise * 0.1, n)
        return np.stack([x, y], axis=1), np.full(n, label)
    X0, y0 = spiral_arm(n, 0, 0)
    X1, y1 = spiral_arm(n, np.pi, 1)
    X = np.vstack([X0, X1])
    y = np.hstack([y0, y1])
    perm = np.random.permutation(len(y))
    return X[perm], y[perm]

X, y = make_spirals()

fig, ax = plt.subplots(figsize=(6, 6))
colors = ['#e74c3c', '#3498db']
for cls in [0, 1]:
    mask = y == cls
    ax.scatter(X[mask, 0], X[mask, 1], s=15, alpha=0.7,
               color=colors[cls], label=f'Class {cls}')
ax.set_title('Two-Spiral Problem\n(unsolvable without deep learning)', fontsize=13)
ax.set_aspect('equal')
ax.legend()
plt.tight_layout()
plt.savefig('spiral_problem.png', dpi=120)
plt.show()
print(f"Dataset: {X.shape[0]} points, {X.shape[1]} features, 2 classes")
print("A logistic regression accuracy on this:", end=" ")

# Attempt with logistic regression to show it fails
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
lr = LogisticRegression(max_iter=1000)
scores = cross_val_score(lr, X, y, cv=5)
print(f"{scores.mean():.1%} ± {scores.std():.1%}  ← near chance")
print("\nBy ch334, you will build a network that achieves >97% on this.")

The logistic regression sits at ~50%. The spiral boundary is nonlinear in a way that cannot be captured by any fixed feature expansion you can write by hand.

The solution is a learned hierarchy of features — which is exactly what a multilayer network learns. Part X builds that understanding from the perceptron up.

Proceed to ch301 — What is Deep Learning?