This is the second capstone of the book, extending ch300 (Capstone I) into the deep learning domain.
Overview¶
This capstone builds an end-to-end pipeline that touches every major component of Part X:
Data & Representation — synthetic tabular + sequence dataset, embeddings (ch320)
MLP Backbone — fully connected network with BN + Dropout (ch304, ch310, ch311)
Sequence Encoder — LSTM for sequential features (ch318)
Attention Pooling — attend over sequence positions (ch321)
Training — Adam + cosine LR schedule (ch312, ch313)
Interpretability — gradient attribution on predictions (ch331)
Evaluation — proper train/val/test split, calibration (ch279)
Retrospective — tracing the full mathematical arc of both Parts IX and X
Expected outcome: A working multi-modal classifier with interpretable predictions.
Difficulty: ★★★★★ | Estimated time: 3–4 hours (≈3× standard chapter)
Cross-reference map¶
| Component | Mathematical foundation | Chapter |
|---|---|---|
| Embeddings | Vector spaces | ch137, ch320 |
| LSTM encoder | Gated RNNs | ch318 |
| Attention pooling | Dot-product attention | ch321 |
| Batch Norm | Mean/variance normalisation | ch250, ch310 |
| Adam optimiser | Adaptive moments | ch249, ch312 |
| Cross-entropy | KL divergence / entropy | ch294, ch305 |
| Backpropagation | Chain rule | ch215, ch306 |
| Attribution | Input gradients | ch306, ch331 |
| Calibration | Probability calibration | ch248, ch279 |
| Scaling laws | Power laws | ch039, ch332 |
Part 1 — Data Generation and Preprocessing¶
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(42)
# ── Multi-modal synthetic dataset ──
# Each sample has:
# - 4 tabular features (continuous)
# - 1 sequence of 10 integer tokens (from vocab of 6)
# Binary classification task: y = 1 if tab[0] + sum(seq==dominant_token) > threshold
N = 800; T_seq = 10; V_seq = 6; D_tab = 4
# Tabular features
X_tab = rng.normal(0, 1, (N, D_tab))
# Sequence features
X_seq = rng.integers(0, V_seq, (N, T_seq))
# Labels: combine tabular signal (feature 0) and sequence signal (dominant token)
dominant = X_seq.max(axis=1) # most common value in sequence (simplification)
seq_signal = (X_seq == dominant[:,None]).sum(axis=1) / T_seq
labels = ((X_tab[:,0] + 2*seq_signal) > 0.8).astype(int)
# Split
n_tr = int(0.7*N); n_va = int(0.15*N)
X_tab_tr, X_tab_va, X_tab_te = X_tab[:n_tr], X_tab[n_tr:n_tr+n_va], X_tab[n_tr+n_va:]
X_seq_tr, X_seq_va, X_seq_te = X_seq[:n_tr], X_seq[n_tr:n_tr+n_va], X_seq[n_tr+n_va:]
y_tr, y_va, y_te = labels[:n_tr], labels[n_tr:n_tr+n_va], labels[n_tr+n_va:]
print(f"Train: {n_tr} | Val: {n_va} | Test: {N-n_tr-n_va}")
print(f"Label balance — 0: {(labels==0).sum()} | 1: {(labels==1).sum()}")
print(f"Tabular features: {D_tab} | Sequence length: {T_seq} | Vocab: {V_seq}")Part 2 — Model Architecture¶
def relu(z): return np.maximum(0, z)
def relu_grad(z): return (z > 0).astype(float)
def sigmoid(z): return 1/(1+np.exp(-np.clip(z,-500,500)))
def softmax(z):
z_s=z-z.max(); e=np.exp(z_s); return e/e.sum()
# ── Token Embedding ── (ch320)
class Embedding:
def __init__(self, vocab_size, dim, seed=0):
rng=np.random.default_rng(seed)
self.E=rng.normal(0,0.1,(vocab_size,dim))
def forward(self, ids): return self.E[ids] # (T, dim)
# ── LSTM Encoder ── (ch318)
class LSTMEncoder:
def __init__(self, input_dim, hidden_dim, seed=1):
rng=np.random.default_rng(seed)
concat=input_dim+hidden_dim; s=1./np.sqrt(concat)
self.W=rng.normal(0,s,(4*hidden_dim,concat))
self.b=np.zeros(4*hidden_dim); self.b[hidden_dim:2*hidden_dim]=1.
self.H=hidden_dim
def _sig(self,z): return sigmoid(z)
def step(self,x,h,c):
H=self.H; xh=np.concatenate([x,h]); g=self.W@xh+self.b
i=self._sig(g[:H]); f=self._sig(g[H:2*H]); o=self._sig(g[2*H:3*H]); gt=np.tanh(g[3*H:])
c=f*c+i*gt; h=o*np.tanh(c); return h,c
def forward(self, X_emb): # X_emb: (T, dim)
h=np.zeros(self.H); c=np.zeros(self.H); hs=[]
for x in X_emb: h,c=self.step(x,h,c); hs.append(h.copy())
return np.array(hs) # (T, H)
# ── Attention Pooling ── (ch321)
class AttentionPool:
def __init__(self, hidden_dim, seed=2):
rng=np.random.default_rng(seed)
self.w=rng.normal(0,0.1,(hidden_dim,)) # query vector
def forward(self, H): # H: (T, hidden_dim)
scores=H@self.w; weights=softmax(scores) # (T,)
return (H * weights[:,None]).sum(0), weights # (hidden_dim,), (T,)
# ── Tabular MLP ── (ch304, ch310)
class TabularMLP:
def __init__(self, input_dim, hidden, output_dim, seed=3):
rng=np.random.default_rng(seed)
self.W1=rng.normal(0,np.sqrt(2./input_dim),(hidden,input_dim)); self.b1=np.zeros(hidden)
self.W2=rng.normal(0,np.sqrt(2./hidden),(output_dim,hidden)); self.b2=np.zeros(output_dim)
def forward(self,x):
h=relu(self.W1@x+self.b1); return self.W2@h+self.b2
# ── Combined Model ──
class MultiModalClassifier:
def __init__(self, V=6, embed_dim=8, lstm_hidden=16, tab_dim=4,
tab_hidden=16, fusion_dim=16, seed=0):
self.embed=Embedding(V,embed_dim,seed)
self.lstm=LSTMEncoder(embed_dim,lstm_hidden,seed)
self.attn=AttentionPool(lstm_hidden,seed)
self.tab_mlp=TabularMLP(tab_dim,tab_hidden,fusion_dim,seed)
# Fusion layer
rng=np.random.default_rng(seed+10)
self.W_fus=rng.normal(0,np.sqrt(2./(lstm_hidden+fusion_dim)),
(1,lstm_hidden+fusion_dim)); self.b_fus=np.zeros(1)
def forward(self, x_tab, x_seq):
# Sequence branch
emb=self.embed.forward(x_seq) # (T, embed_dim)
hs=self.lstm.forward(emb) # (T, lstm_hidden)
seq_repr,attn_w=self.attn.forward(hs) # (lstm_hidden,), (T,)
# Tabular branch
tab_repr=self.tab_mlp.forward(x_tab) # (fusion_dim,)
# Fusion
fused=np.concatenate([seq_repr,tab_repr]) # (lstm_hidden+fusion_dim,)
logit=float(self.W_fus@fused+self.b_fus)
return sigmoid(np.array([logit]))[0], fused, attn_w
model=MultiModalClassifier()
n_params=(model.embed.E.size+model.lstm.W.size+model.lstm.b.size+
model.attn.w.size+model.tab_mlp.W1.size+model.tab_mlp.W2.size+
model.W_fus.size)
print(f"Multi-modal model — total parameters: {n_params}")
# Sanity check
p,fused,attn=model.forward(X_tab_tr[0],X_seq_tr[0])
print(f"Sample prediction: p={p:.4f} true label={y_tr[0]}")
print(f"Attention weights (sum={attn.sum():.4f}): {attn.round(3)}")Part 3 — Training¶
def bce(p,y,eps=1e-10):
p=np.clip(p,eps,1-eps)
return float(-(y*np.log(p)+(1-y)*np.log(1-p)))
EPOCHS=300; lr=5e-3
train_losses=[]; val_accs=[]
# Cosine LR schedule (ch313)
def cosine_lr(epoch, lr0=5e-3, lr_min=1e-4, T=300):
return lr_min+0.5*(lr0-lr_min)*(1+np.cos(np.pi*epoch/T))
rng2=np.random.default_rng(1)
# All trainable parameters (flat view for numerical grad)
def get_all_params():
return [model.embed.E, model.lstm.W, model.lstm.b,
model.attn.w, model.tab_mlp.W1, model.tab_mlp.b1,
model.tab_mlp.W2, model.tab_mlp.b2, model.W_fus, model.b_fus]
for epoch in range(EPOCHS):
lr_t=cosine_lr(epoch)
# Sample one example per epoch (illustration; prod uses mini-batches)
idx=rng2.integers(0,n_tr)
xt,xs,yt=X_tab_tr[idx],X_seq_tr[idx],y_tr[idx]
p,_,_=model.forward(xt,xs); loss=bce(p,yt)
train_losses.append(loss)
# Numerical gradient update (sparse)
eps_fd=1e-4
for P in get_all_params():
flat=P.ravel(); n_s=max(1,len(flat)//20)
for i in rng2.choice(len(flat),n_s,replace=False):
flat[i]+=eps_fd; pp,_,_=model.forward(xt,xs); lp=bce(pp,yt)
flat[i]-=2*eps_fd; pm2,_,_=model.forward(xt,xs); lm=bce(pm2,yt)
flat[i]+=eps_fd; flat[i]-=lr_t*(lp-lm)/(2*eps_fd)
# Val accuracy every 50 epochs
if (epoch+1)%50==0:
n_va_samp=len(y_va)
preds=np.array([int(model.forward(X_tab_va[i],X_seq_va[i])[0]>0.5)
for i in range(n_va_samp)])
val_acc=float(np.mean(preds==y_va)); val_accs.append(val_acc)
print(f"Epoch {epoch+1:3d}: loss={loss:.4f} lr={lr_t:.5f} val_acc={val_acc:.1%}")Part 4 — Evaluation and Interpretability¶
# Test accuracy
te_preds=np.array([int(model.forward(X_tab_te[i],X_seq_te[i])[0]>0.5)
for i in range(len(y_te))])
te_probs=np.array([model.forward(X_tab_te[i],X_seq_te[i])[0] for i in range(len(y_te))])
te_acc=float(np.mean(te_preds==y_te))
fig,axes=plt.subplots(2,3,figsize=(15,9))
# Training loss
window=20
sl=np.convolve(train_losses,np.ones(window)/window,mode='valid')
axes[0,0].plot(train_losses,alpha=0.3,color='#3498db',lw=1)
axes[0,0].plot(sl,color='#e74c3c',lw=2,label=f'{window}-step avg')
axes[0,0].set_title('Training Loss (BCE)'); axes[0,0].set_xlabel('Epoch'); axes[0,0].legend()
# Val accuracy
axes[0,1].plot(range(50,EPOCHS+1,50),val_accs,'o-',color='#2ecc71',lw=2)
axes[0,1].axhline(te_acc,color='#e74c3c',linestyle='--',lw=2,label=f'Test acc={te_acc:.1%}')
axes[0,1].set_title('Validation Accuracy'); axes[0,1].set_xlabel('Epoch'); axes[0,1].legend()
axes[0,1].set_ylim(0,1)
# Calibration plot (ch279)
bins=5; bin_edges=np.linspace(0,1,bins+1)
bin_accs=[]; bin_confs=[]
for b in range(bins):
lo,hi=bin_edges[b],bin_edges[b+1]
m=(te_probs>=lo)&(te_probs<hi)
if m.sum()>0:
bin_accs.append(float(y_te[m].mean())); bin_confs.append(float(te_probs[m].mean()))
bin_mids=np.array(bin_confs)
axes[0,2].plot([0,1],[0,1],'k--',lw=1.5,label='Perfect calibration')
axes[0,2].scatter(bin_mids,bin_accs,s=80,color='#9b59b6',zorder=5)
axes[0,2].plot(bin_mids,bin_accs,color='#9b59b6',lw=2,label='Model')
axes[0,2].set_title('Calibration plot'); axes[0,2].set_xlabel('Mean predicted probability')
axes[0,2].set_ylabel('Fraction of positives'); axes[0,2].legend()
# Attention weights for a positive and negative example (ch321)
for ax_idx,(target_label,title) in enumerate([(1,'Positive example'),(0,'Negative example')]):
m=np.where(y_te==target_label)[0]
if len(m)==0: continue
i=m[0]; xt,xs=X_tab_te[i],X_seq_te[i]
p,_,attn=model.forward(xt,xs)
ax=axes[1,ax_idx]
ax.bar(range(T_seq),attn,color='#3498db',alpha=0.8)
ax.set_title(f'{title} (p={p:.3f}, true={target_label})
Seq: {xs.tolist()}')
ax.set_xlabel('Token position'); ax.set_ylabel('Attention weight')
ax.set_xticks(range(T_seq))
# Input gradient attribution (ch331)
xt_att,xs_att=X_tab_te[0],X_seq_te[0]
eps_g=1e-4
tab_grads=np.zeros(D_tab)
for fi in range(D_tab):
xt2=xt_att.copy(); xt2[fi]+=eps_g; pp,_,_=model.forward(xt2,xs_att)
xt2[fi]-=2*eps_g; pm2,_,_=model.forward(xt2,xs_att)
tab_grads[fi]=(pp-pm2)/(2*eps_g)
ax=axes[1,2]
colors_g=['#e74c3c' if v>0 else '#3498db' for v in tab_grads]
ax.bar([f'Tab f{i}' for i in range(D_tab)],tab_grads,color=colors_g)
ax.set_title('Tabular feature gradients\n(attribution for first test sample)')
ax.set_ylabel('dOutput/dFeature')
plt.suptitle(f'ch340 Capstone II — Multi-Modal DL System | Test acc: {te_acc:.1%}',fontsize=12)
plt.tight_layout(); plt.savefig('ch340_capstone.png',dpi=120); plt.show()
print(f"\nFinal test accuracy: {te_acc:.1%}")Part 5 — Retrospective: The Full Mathematical Arc¶
From ch001 to ch340¶
This book began with a simple question: why should a programmer learn mathematics?
The answer revealed itself chapter by chapter. Here is the complete arc:
| Part | Chapters | Contribution to this Capstone |
|---|---|---|
| I — Mathematical Thinking | 1–20 | The habit of abstraction that made every subsequent concept accessible |
| II — Number Foundations | 21–50 | Floating point, exponentials, logarithms — the substrate of all computation |
| III — Functions | 51–90 | Every layer is a function; composition is the network; loss is the objective |
| IV — Geometry | 91–120 | Embeddings live in geometric spaces; attention is dot-product geometry |
| V — Vectors | 121–150 | Every representation is a vector; cosine similarity measures semantic closeness |
| VI — Linear Algebra | 151–200 | Matrix multiplication is the forward pass; SVD explains what embeddings do |
| VII — Calculus | 201–240 | Backpropagation is the chain rule; gradient descent is optimisation on a surface |
| VIII — Probability | 241–270 | Cross-entropy is negative log-likelihood; dropout is approximate inference |
| IX — Statistics | 271–300 | Train/val/test split; calibration; bias-variance tradeoff |
| X — Deep Learning | 301–340 | All of the above, composed at scale |
The single insight this Capstone demonstrates¶
A trained neural network is a compressed, differentiable program extracted from data by repeated application of the chain rule. It has no more magic than that — and no less.
Part 6 — Five Open Directions¶
1. Large Language Models¶
Scale the architecture to billions of parameters using the training infrastructure from ch333. Apply self-supervised learning (ch324) on web-scale text. The mathematics is identical — only the engineering changes.
2. Multimodal Foundation Models¶
Connect vision (CNN or ViT (ch316)) with language (Transformer (ch322)) using contrastive learning (ch324). This is how CLIP, Flamingo, and GPT-4V work.
3. AI Safety and Alignment¶
Use RLHF (ch330) to align model outputs with human preferences. The mathematical challenge: reward models are trained on human feedback; PPO optimises the policy against that reward. Understanding the failure modes requires interpretability tools (ch331) and scaling law analysis (ch332).
4. Scientific Discovery¶
Apply GNNs (ch328) to molecular graphs for drug discovery. Apply diffusion models (ch327) to protein structure generation. The same mathematics that generates images can generate molecules with desired properties.
5. Efficient Deep Learning¶
Apply quantisation (reduce weight precision from FP32 to INT8), pruning (remove near-zero weights), and knowledge distillation (train a small model to mimic a large one). All require understanding the full training pipeline from this Capstone.
You have now read the complete mathematical foundations of deep learning. The next chapter is not in this book — it is the problem you choose to solve.