Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

What this Part covers and why it matters

You have spent nine Parts building the mathematical infrastructure that makes deep learning possible. Every concept in this Part has an explicit debt to what came before:

Deep Learning conceptMathematical foundation
Forward passMatrix multiplication (ch153–155), function composition (ch054)
BackpropagationChain rule (ch215), partial derivatives (ch210)
Gradient descentOptimisation landscapes (ch213), gradient intuition (ch209)
Loss functionsKL divergence (ch295), entropy (ch294)
EmbeddingsVector spaces (ch137), PCA (ch178)
AttentionDot product (ch131), softmax (ch065)
Generative modelsProbability distributions (ch249), Bayes (ch246)

Part X is where the mathematics becomes a system. You will build neural networks from scratch, derive every equation, then scale up to the architectures that power modern AI.


The mental shift

In Parts I–IX, functions were things you analysed. In Part X, functions are things you optimise.

A neural network is a parameterised function family. Training is the process of searching that family for the member that minimises a loss. Every “learning” algorithm is ultimately a statement about how to move through parameter space.

The shift:

  • From solving equations → approximating functions with gradient-guided search

  • From fixed transformations → learned transformations

  • From one modelcompositions of millions of simple operations


Part map

FOUNDATIONS (301–309)
  Perceptron → MLP → Forward Pass → Loss → Backprop → SGD → Init → Activations
        ↓
TRAINING ENGINEERING (310–313)
  BatchNorm → Dropout → Optimisers → LR Schedules
        ↓
ARCHITECTURES (314–323)
  CNNs → RNNs → LSTMs → Seq2Seq → Embeddings → Attention → Transformers → Positional Encoding
        ↓
ADVANCED MODELS (324–330)
  Self-Supervised → VAE → GAN → Diffusion → GNN → RL → Policy Gradients
        ↓
SYSTEMS (331–333)
  Interpretability → Scaling Laws → Training Infrastructure
        ↓
PROJECTS (334–339)
  NumPy Net → CNN → Char-LM → Transformer Block → VAE → RL CartPole
        ↓
CAPSTONE II (340)
  End-to-End Deep Learning System

Prerequisites from prior Parts

Required — you must be comfortable with these before proceeding:

  • Matrix multiplication and shapes (ch153–ch155)

  • Chain rule and partial derivatives (ch209–ch211, ch215)

  • Gradient descent (ch212–ch213)

  • Sigmoid and activation functions (ch064–ch065)

  • Probability distributions and expectation (ch248–ch250)

  • Entropy and KL divergence (ch294–ch295)

  • NumPy vectorised operations (ch147)

Useful but not blocking:

  • SVD and PCA (ch177–ch178)

  • Markov chains (ch258)

  • Information theory (ch293)


Motivating problem: Can you solve this right now?

Below is a dataset of 500 points in 2D space sampled from two interleaved spirals. No linear classifier can separate them. No polynomial of fixed degree captures the boundary.

By the end of Part X, you will build the model that solves this from mathematical first principles.

import numpy as np
import matplotlib.pyplot as plt

def make_spirals(n=250, noise=0.4, seed=42):
    rng = np.random.default_rng(seed)
    def spiral_arm(n, delta_theta, label):
        theta = np.linspace(0, 4 * np.pi, n) + delta_theta
        r = theta / (4 * np.pi)
        x = r * np.cos(theta) + rng.normal(0, noise * 0.1, n)
        y = r * np.sin(theta) + rng.normal(0, noise * 0.1, n)
        return np.stack([x, y], axis=1), np.full(n, label)
    X0, y0 = spiral_arm(n, 0, 0)
    X1, y1 = spiral_arm(n, np.pi, 1)
    X = np.vstack([X0, X1])
    y = np.hstack([y0, y1])
    perm = np.random.permutation(len(y))
    return X[perm], y[perm]

X, y = make_spirals()

fig, ax = plt.subplots(figsize=(6, 6))
colors = ['#e74c3c', '#3498db']
for cls in [0, 1]:
    mask = y == cls
    ax.scatter(X[mask, 0], X[mask, 1], s=15, alpha=0.7,
               color=colors[cls], label=f'Class {cls}')
ax.set_title('Two-Spiral Problem\n(unsolvable without deep learning)', fontsize=13)
ax.set_aspect('equal')
ax.legend()
plt.tight_layout()
plt.savefig('spiral_problem.png', dpi=120)
plt.show()
print(f"Dataset: {X.shape[0]} points, {X.shape[1]} features, 2 classes")
print("A logistic regression accuracy on this:", end=" ")

# Attempt with logistic regression to show it fails
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
lr = LogisticRegression(max_iter=1000)
scores = cross_val_score(lr, X, y, cv=5)
print(f"{scores.mean():.1%} ± {scores.std():.1%}  ← near chance")
print("\nBy ch334, you will build a network that achieves >97% on this.")

The logistic regression sits at ~50%. The spiral boundary is nonlinear in a way that cannot be captured by any fixed feature expansion you can write by hand.

The solution is a learned hierarchy of features — which is exactly what a multilayer network learns. Part X builds that understanding from the perceptron up.

Proceed to ch301 — What is Deep Learning?