ch340 — Capstone II: End-to-End Deep Learning System

This is the second capstone of the book, extending ch300 (Capstone I) into the deep learning domain.

Overview¶

This capstone builds an end-to-end pipeline that touches every major component of Part X:

Data & Representation — synthetic tabular + sequence dataset, embeddings (ch320)
MLP Backbone — fully connected network with BN + Dropout (ch304, ch310, ch311)
Sequence Encoder — LSTM for sequential features (ch318)
Attention Pooling — attend over sequence positions (ch321)
Training — Adam + cosine LR schedule (ch312, ch313)
Interpretability — gradient attribution on predictions (ch331)
Evaluation — proper train/val/test split, calibration (ch279)
Retrospective — tracing the full mathematical arc of both Parts IX and X

Expected outcome: A working multi-modal classifier with interpretable predictions.

Difficulty: ★★★★★ | Estimated time: 3–4 hours (≈3× standard chapter)

Cross-reference map¶

Component	Mathematical foundation	Chapter
Embeddings	Vector spaces	ch137, ch320
LSTM encoder	Gated RNNs	ch318
Attention pooling	Dot-product attention	ch321
Batch Norm	Mean/variance normalisation	ch250, ch310
Adam optimiser	Adaptive moments	ch249, ch312
Cross-entropy	KL divergence / entropy	ch294, ch305
Backpropagation	Chain rule	ch215, ch306
Attribution	Input gradients	ch306, ch331
Calibration	Probability calibration	ch248, ch279
Scaling laws	Power laws	ch039, ch332

Part 1 — Data Generation and Preprocessing¶

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)

# ── Multi-modal synthetic dataset ──
# Each sample has:
#   - 4 tabular features (continuous)
#   - 1 sequence of 10 integer tokens (from vocab of 6)
# Binary classification task: y = 1 if tab[0] + sum(seq==dominant_token) > threshold

N = 800; T_seq = 10; V_seq = 6; D_tab = 4

# Tabular features
X_tab = rng.normal(0, 1, (N, D_tab))

# Sequence features
X_seq = rng.integers(0, V_seq, (N, T_seq))

# Labels: combine tabular signal (feature 0) and sequence signal (dominant token)
dominant = X_seq.max(axis=1)  # most common value in sequence (simplification)
seq_signal = (X_seq == dominant[:,None]).sum(axis=1) / T_seq
labels = ((X_tab[:,0] + 2*seq_signal) > 0.8).astype(int)

# Split
n_tr = int(0.7*N); n_va = int(0.15*N)
X_tab_tr, X_tab_va, X_tab_te = X_tab[:n_tr], X_tab[n_tr:n_tr+n_va], X_tab[n_tr+n_va:]
X_seq_tr, X_seq_va, X_seq_te = X_seq[:n_tr], X_seq[n_tr:n_tr+n_va], X_seq[n_tr+n_va:]
y_tr, y_va, y_te = labels[:n_tr], labels[n_tr:n_tr+n_va], labels[n_tr+n_va:]

print(f"Train: {n_tr} | Val: {n_va} | Test: {N-n_tr-n_va}")
print(f"Label balance — 0: {(labels==0).sum()} | 1: {(labels==1).sum()}")
print(f"Tabular features: {D_tab} | Sequence length: {T_seq} | Vocab: {V_seq}")

Part 2 — Model Architecture¶

def relu(z): return np.maximum(0, z)
def relu_grad(z): return (z > 0).astype(float)
def sigmoid(z): return 1/(1+np.exp(-np.clip(z,-500,500)))
def softmax(z):
    z_s=z-z.max(); e=np.exp(z_s); return e/e.sum()

# ── Token Embedding ──  (ch320)
class Embedding:
    def __init__(self, vocab_size, dim, seed=0):
        rng=np.random.default_rng(seed)
        self.E=rng.normal(0,0.1,(vocab_size,dim))
    def forward(self, ids): return self.E[ids]  # (T, dim)

# ── LSTM Encoder ──  (ch318)
class LSTMEncoder:
    def __init__(self, input_dim, hidden_dim, seed=1):
        rng=np.random.default_rng(seed)
        concat=input_dim+hidden_dim; s=1./np.sqrt(concat)
        self.W=rng.normal(0,s,(4*hidden_dim,concat))
        self.b=np.zeros(4*hidden_dim); self.b[hidden_dim:2*hidden_dim]=1.
        self.H=hidden_dim
    def _sig(self,z): return sigmoid(z)
    def step(self,x,h,c):
        H=self.H; xh=np.concatenate([x,h]); g=self.W@xh+self.b
        i=self._sig(g[:H]); f=self._sig(g[H:2*H]); o=self._sig(g[2*H:3*H]); gt=np.tanh(g[3*H:])
        c=f*c+i*gt; h=o*np.tanh(c); return h,c
    def forward(self, X_emb):  # X_emb: (T, dim)
        h=np.zeros(self.H); c=np.zeros(self.H); hs=[]
        for x in X_emb: h,c=self.step(x,h,c); hs.append(h.copy())
        return np.array(hs)  # (T, H)

# ── Attention Pooling ──  (ch321)
class AttentionPool:
    def __init__(self, hidden_dim, seed=2):
        rng=np.random.default_rng(seed)
        self.w=rng.normal(0,0.1,(hidden_dim,))  # query vector
    def forward(self, H):  # H: (T, hidden_dim)
        scores=H@self.w; weights=softmax(scores)  # (T,)
        return (H * weights[:,None]).sum(0), weights  # (hidden_dim,), (T,)

# ── Tabular MLP ──  (ch304, ch310)
class TabularMLP:
    def __init__(self, input_dim, hidden, output_dim, seed=3):
        rng=np.random.default_rng(seed)
        self.W1=rng.normal(0,np.sqrt(2./input_dim),(hidden,input_dim)); self.b1=np.zeros(hidden)
        self.W2=rng.normal(0,np.sqrt(2./hidden),(output_dim,hidden));   self.b2=np.zeros(output_dim)
    def forward(self,x):
        h=relu(self.W1@x+self.b1); return self.W2@h+self.b2

# ── Combined Model ──
class MultiModalClassifier:
    def __init__(self, V=6, embed_dim=8, lstm_hidden=16, tab_dim=4,
                 tab_hidden=16, fusion_dim=16, seed=0):
        self.embed=Embedding(V,embed_dim,seed)
        self.lstm=LSTMEncoder(embed_dim,lstm_hidden,seed)
        self.attn=AttentionPool(lstm_hidden,seed)
        self.tab_mlp=TabularMLP(tab_dim,tab_hidden,fusion_dim,seed)
        # Fusion layer
        rng=np.random.default_rng(seed+10)
        self.W_fus=rng.normal(0,np.sqrt(2./(lstm_hidden+fusion_dim)),
                               (1,lstm_hidden+fusion_dim)); self.b_fus=np.zeros(1)

    def forward(self, x_tab, x_seq):
        # Sequence branch
        emb=self.embed.forward(x_seq)             # (T, embed_dim)
        hs=self.lstm.forward(emb)                  # (T, lstm_hidden)
        seq_repr,attn_w=self.attn.forward(hs)      # (lstm_hidden,), (T,)
        # Tabular branch
        tab_repr=self.tab_mlp.forward(x_tab)       # (fusion_dim,)
        # Fusion
        fused=np.concatenate([seq_repr,tab_repr])  # (lstm_hidden+fusion_dim,)
        logit=float(self.W_fus@fused+self.b_fus)
        return sigmoid(np.array([logit]))[0], fused, attn_w

model=MultiModalClassifier()
n_params=(model.embed.E.size+model.lstm.W.size+model.lstm.b.size+
          model.attn.w.size+model.tab_mlp.W1.size+model.tab_mlp.W2.size+
          model.W_fus.size)
print(f"Multi-modal model — total parameters: {n_params}")

# Sanity check
p,fused,attn=model.forward(X_tab_tr[0],X_seq_tr[0])
print(f"Sample prediction: p={p:.4f}  true label={y_tr[0]}")
print(f"Attention weights (sum={attn.sum():.4f}): {attn.round(3)}")

Part 3 — Training¶

def bce(p,y,eps=1e-10):
    p=np.clip(p,eps,1-eps)
    return float(-(y*np.log(p)+(1-y)*np.log(1-p)))

EPOCHS=300; lr=5e-3
train_losses=[]; val_accs=[]

# Cosine LR schedule  (ch313)
def cosine_lr(epoch, lr0=5e-3, lr_min=1e-4, T=300):
    return lr_min+0.5*(lr0-lr_min)*(1+np.cos(np.pi*epoch/T))

rng2=np.random.default_rng(1)

# All trainable parameters (flat view for numerical grad)
def get_all_params():
    return [model.embed.E, model.lstm.W, model.lstm.b,
            model.attn.w, model.tab_mlp.W1, model.tab_mlp.b1,
            model.tab_mlp.W2, model.tab_mlp.b2, model.W_fus, model.b_fus]

for epoch in range(EPOCHS):
    lr_t=cosine_lr(epoch)
    # Sample one example per epoch (illustration; prod uses mini-batches)
    idx=rng2.integers(0,n_tr)
    xt,xs,yt=X_tab_tr[idx],X_seq_tr[idx],y_tr[idx]
    p,_,_=model.forward(xt,xs); loss=bce(p,yt)
    train_losses.append(loss)

    # Numerical gradient update (sparse)
    eps_fd=1e-4
    for P in get_all_params():
        flat=P.ravel(); n_s=max(1,len(flat)//20)
        for i in rng2.choice(len(flat),n_s,replace=False):
            flat[i]+=eps_fd; pp,_,_=model.forward(xt,xs); lp=bce(pp,yt)
            flat[i]-=2*eps_fd; pm2,_,_=model.forward(xt,xs); lm=bce(pm2,yt)
            flat[i]+=eps_fd; flat[i]-=lr_t*(lp-lm)/(2*eps_fd)

    # Val accuracy every 50 epochs
    if (epoch+1)%50==0:
        n_va_samp=len(y_va)
        preds=np.array([int(model.forward(X_tab_va[i],X_seq_va[i])[0]>0.5)
                         for i in range(n_va_samp)])
        val_acc=float(np.mean(preds==y_va)); val_accs.append(val_acc)
        print(f"Epoch {epoch+1:3d}: loss={loss:.4f}  lr={lr_t:.5f}  val_acc={val_acc:.1%}")

Part 4 — Evaluation and Interpretability¶

# Test accuracy
te_preds=np.array([int(model.forward(X_tab_te[i],X_seq_te[i])[0]>0.5)
                    for i in range(len(y_te))])
te_probs=np.array([model.forward(X_tab_te[i],X_seq_te[i])[0] for i in range(len(y_te))])
te_acc=float(np.mean(te_preds==y_te))

fig,axes=plt.subplots(2,3,figsize=(15,9))

# Training loss
window=20
sl=np.convolve(train_losses,np.ones(window)/window,mode='valid')
axes[0,0].plot(train_losses,alpha=0.3,color='#3498db',lw=1)
axes[0,0].plot(sl,color='#e74c3c',lw=2,label=f'{window}-step avg')
axes[0,0].set_title('Training Loss (BCE)'); axes[0,0].set_xlabel('Epoch'); axes[0,0].legend()

# Val accuracy
axes[0,1].plot(range(50,EPOCHS+1,50),val_accs,'o-',color='#2ecc71',lw=2)
axes[0,1].axhline(te_acc,color='#e74c3c',linestyle='--',lw=2,label=f'Test acc={te_acc:.1%}')
axes[0,1].set_title('Validation Accuracy'); axes[0,1].set_xlabel('Epoch'); axes[0,1].legend()
axes[0,1].set_ylim(0,1)

# Calibration plot  (ch279)
bins=5; bin_edges=np.linspace(0,1,bins+1)
bin_accs=[]; bin_confs=[]
for b in range(bins):
    lo,hi=bin_edges[b],bin_edges[b+1]
    m=(te_probs>=lo)&(te_probs<hi)
    if m.sum()>0:
        bin_accs.append(float(y_te[m].mean())); bin_confs.append(float(te_probs[m].mean()))
bin_mids=np.array(bin_confs)
axes[0,2].plot([0,1],[0,1],'k--',lw=1.5,label='Perfect calibration')
axes[0,2].scatter(bin_mids,bin_accs,s=80,color='#9b59b6',zorder=5)
axes[0,2].plot(bin_mids,bin_accs,color='#9b59b6',lw=2,label='Model')
axes[0,2].set_title('Calibration plot'); axes[0,2].set_xlabel('Mean predicted probability')
axes[0,2].set_ylabel('Fraction of positives'); axes[0,2].legend()

# Attention weights for a positive and negative example  (ch321)
for ax_idx,(target_label,title) in enumerate([(1,'Positive example'),(0,'Negative example')]):
    m=np.where(y_te==target_label)[0]
    if len(m)==0: continue
    i=m[0]; xt,xs=X_tab_te[i],X_seq_te[i]
    p,_,attn=model.forward(xt,xs)
    ax=axes[1,ax_idx]
    ax.bar(range(T_seq),attn,color='#3498db',alpha=0.8)
    ax.set_title(f'{title} (p={p:.3f}, true={target_label})
Seq: {xs.tolist()}')
    ax.set_xlabel('Token position'); ax.set_ylabel('Attention weight')
    ax.set_xticks(range(T_seq))

# Input gradient attribution  (ch331)
xt_att,xs_att=X_tab_te[0],X_seq_te[0]
eps_g=1e-4
tab_grads=np.zeros(D_tab)
for fi in range(D_tab):
    xt2=xt_att.copy(); xt2[fi]+=eps_g; pp,_,_=model.forward(xt2,xs_att)
    xt2[fi]-=2*eps_g; pm2,_,_=model.forward(xt2,xs_att)
    tab_grads[fi]=(pp-pm2)/(2*eps_g)

ax=axes[1,2]
colors_g=['#e74c3c' if v>0 else '#3498db' for v in tab_grads]
ax.bar([f'Tab f{i}' for i in range(D_tab)],tab_grads,color=colors_g)
ax.set_title('Tabular feature gradients\n(attribution for first test sample)')
ax.set_ylabel('dOutput/dFeature')

plt.suptitle(f'ch340 Capstone II — Multi-Modal DL System  |  Test acc: {te_acc:.1%}',fontsize=12)
plt.tight_layout(); plt.savefig('ch340_capstone.png',dpi=120); plt.show()
print(f"\nFinal test accuracy: {te_acc:.1%}")

Part 5 — Retrospective: The Full Mathematical Arc¶

From ch001 to ch340¶

This book began with a simple question: why should a programmer learn mathematics?

The answer revealed itself chapter by chapter. Here is the complete arc:

Part	Chapters	Contribution to this Capstone
I — Mathematical Thinking	1–20	The habit of abstraction that made every subsequent concept accessible
II — Number Foundations	21–50	Floating point, exponentials, logarithms — the substrate of all computation
III — Functions	51–90	Every layer is a function; composition is the network; loss is the objective
IV — Geometry	91–120	Embeddings live in geometric spaces; attention is dot-product geometry
V — Vectors	121–150	Every representation is a vector; cosine similarity measures semantic closeness
VI — Linear Algebra	151–200	Matrix multiplication is the forward pass; SVD explains what embeddings do
VII — Calculus	201–240	Backpropagation is the chain rule; gradient descent is optimisation on a surface
VIII — Probability	241–270	Cross-entropy is negative log-likelihood; dropout is approximate inference
IX — Statistics	271–300	Train/val/test split; calibration; bias-variance tradeoff
X — Deep Learning	301–340	All of the above, composed at scale

The single insight this Capstone demonstrates¶

A trained neural network is a compressed, differentiable program extracted from data by repeated application of the chain rule. It has no more magic than that — and no less.

Part 6 — Five Open Directions¶

1. Large Language Models¶

Scale the architecture to billions of parameters using the training infrastructure from ch333. Apply self-supervised learning (ch324) on web-scale text. The mathematics is identical — only the engineering changes.

2. Multimodal Foundation Models¶

Connect vision (CNN or ViT (ch316)) with language (Transformer (ch322)) using contrastive learning (ch324). This is how CLIP, Flamingo, and GPT-4V work.

3. AI Safety and Alignment¶

Use RLHF (ch330) to align model outputs with human preferences. The mathematical challenge: reward models are trained on human feedback; PPO optimises the policy against that reward. Understanding the failure modes requires interpretability tools (ch331) and scaling law analysis (ch332).

4. Scientific Discovery¶

Apply GNNs (ch328) to molecular graphs for drug discovery. Apply diffusion models (ch327) to protein structure generation. The same mathematics that generates images can generate molecules with desired properties.

5. Efficient Deep Learning¶

Apply quantisation (reduce weight precision from FP32 to INT8), pruning (remove near-zero weights), and knowledge distillation (train a small model to mimic a large one). All require understanding the full training pipeline from this Capstone.

You have now read the complete mathematical foundations of deep learning. The next chapter is not in this book — it is the problem you choose to solve.