Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

ch340 — Capstone II: End-to-End Deep Learning System

This is the second capstone of the book, extending ch300 (Capstone I) into the deep learning domain.

Overview

This capstone builds an end-to-end pipeline that touches every major component of Part X:

  1. Data & Representation — synthetic tabular + sequence dataset, embeddings (ch320)

  2. MLP Backbone — fully connected network with BN + Dropout (ch304, ch310, ch311)

  3. Sequence Encoder — LSTM for sequential features (ch318)

  4. Attention Pooling — attend over sequence positions (ch321)

  5. Training — Adam + cosine LR schedule (ch312, ch313)

  6. Interpretability — gradient attribution on predictions (ch331)

  7. Evaluation — proper train/val/test split, calibration (ch279)

  8. Retrospective — tracing the full mathematical arc of both Parts IX and X

Expected outcome: A working multi-modal classifier with interpretable predictions.

Difficulty: ★★★★★ | Estimated time: 3–4 hours (≈3× standard chapter)


Cross-reference map

ComponentMathematical foundationChapter
EmbeddingsVector spacesch137, ch320
LSTM encoderGated RNNsch318
Attention poolingDot-product attentionch321
Batch NormMean/variance normalisationch250, ch310
Adam optimiserAdaptive momentsch249, ch312
Cross-entropyKL divergence / entropych294, ch305
BackpropagationChain rulech215, ch306
AttributionInput gradientsch306, ch331
CalibrationProbability calibrationch248, ch279
Scaling lawsPower lawsch039, ch332

Part 1 — Data Generation and Preprocessing

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)

# ── Multi-modal synthetic dataset ──
# Each sample has:
#   - 4 tabular features (continuous)
#   - 1 sequence of 10 integer tokens (from vocab of 6)
# Binary classification task: y = 1 if tab[0] + sum(seq==dominant_token) > threshold

N = 800; T_seq = 10; V_seq = 6; D_tab = 4

# Tabular features
X_tab = rng.normal(0, 1, (N, D_tab))

# Sequence features
X_seq = rng.integers(0, V_seq, (N, T_seq))

# Labels: combine tabular signal (feature 0) and sequence signal (dominant token)
dominant = X_seq.max(axis=1)  # most common value in sequence (simplification)
seq_signal = (X_seq == dominant[:,None]).sum(axis=1) / T_seq
labels = ((X_tab[:,0] + 2*seq_signal) > 0.8).astype(int)

# Split
n_tr = int(0.7*N); n_va = int(0.15*N)
X_tab_tr, X_tab_va, X_tab_te = X_tab[:n_tr], X_tab[n_tr:n_tr+n_va], X_tab[n_tr+n_va:]
X_seq_tr, X_seq_va, X_seq_te = X_seq[:n_tr], X_seq[n_tr:n_tr+n_va], X_seq[n_tr+n_va:]
y_tr, y_va, y_te = labels[:n_tr], labels[n_tr:n_tr+n_va], labels[n_tr+n_va:]

print(f"Train: {n_tr} | Val: {n_va} | Test: {N-n_tr-n_va}")
print(f"Label balance — 0: {(labels==0).sum()} | 1: {(labels==1).sum()}")
print(f"Tabular features: {D_tab} | Sequence length: {T_seq} | Vocab: {V_seq}")

Part 2 — Model Architecture

def relu(z): return np.maximum(0, z)
def relu_grad(z): return (z > 0).astype(float)
def sigmoid(z): return 1/(1+np.exp(-np.clip(z,-500,500)))
def softmax(z):
    z_s=z-z.max(); e=np.exp(z_s); return e/e.sum()

# ── Token Embedding ──  (ch320)
class Embedding:
    def __init__(self, vocab_size, dim, seed=0):
        rng=np.random.default_rng(seed)
        self.E=rng.normal(0,0.1,(vocab_size,dim))
    def forward(self, ids): return self.E[ids]  # (T, dim)

# ── LSTM Encoder ──  (ch318)
class LSTMEncoder:
    def __init__(self, input_dim, hidden_dim, seed=1):
        rng=np.random.default_rng(seed)
        concat=input_dim+hidden_dim; s=1./np.sqrt(concat)
        self.W=rng.normal(0,s,(4*hidden_dim,concat))
        self.b=np.zeros(4*hidden_dim); self.b[hidden_dim:2*hidden_dim]=1.
        self.H=hidden_dim
    def _sig(self,z): return sigmoid(z)
    def step(self,x,h,c):
        H=self.H; xh=np.concatenate([x,h]); g=self.W@xh+self.b
        i=self._sig(g[:H]); f=self._sig(g[H:2*H]); o=self._sig(g[2*H:3*H]); gt=np.tanh(g[3*H:])
        c=f*c+i*gt; h=o*np.tanh(c); return h,c
    def forward(self, X_emb):  # X_emb: (T, dim)
        h=np.zeros(self.H); c=np.zeros(self.H); hs=[]
        for x in X_emb: h,c=self.step(x,h,c); hs.append(h.copy())
        return np.array(hs)  # (T, H)

# ── Attention Pooling ──  (ch321)
class AttentionPool:
    def __init__(self, hidden_dim, seed=2):
        rng=np.random.default_rng(seed)
        self.w=rng.normal(0,0.1,(hidden_dim,))  # query vector
    def forward(self, H):  # H: (T, hidden_dim)
        scores=H@self.w; weights=softmax(scores)  # (T,)
        return (H * weights[:,None]).sum(0), weights  # (hidden_dim,), (T,)

# ── Tabular MLP ──  (ch304, ch310)
class TabularMLP:
    def __init__(self, input_dim, hidden, output_dim, seed=3):
        rng=np.random.default_rng(seed)
        self.W1=rng.normal(0,np.sqrt(2./input_dim),(hidden,input_dim)); self.b1=np.zeros(hidden)
        self.W2=rng.normal(0,np.sqrt(2./hidden),(output_dim,hidden));   self.b2=np.zeros(output_dim)
    def forward(self,x):
        h=relu(self.W1@x+self.b1); return self.W2@h+self.b2

# ── Combined Model ──
class MultiModalClassifier:
    def __init__(self, V=6, embed_dim=8, lstm_hidden=16, tab_dim=4,
                 tab_hidden=16, fusion_dim=16, seed=0):
        self.embed=Embedding(V,embed_dim,seed)
        self.lstm=LSTMEncoder(embed_dim,lstm_hidden,seed)
        self.attn=AttentionPool(lstm_hidden,seed)
        self.tab_mlp=TabularMLP(tab_dim,tab_hidden,fusion_dim,seed)
        # Fusion layer
        rng=np.random.default_rng(seed+10)
        self.W_fus=rng.normal(0,np.sqrt(2./(lstm_hidden+fusion_dim)),
                               (1,lstm_hidden+fusion_dim)); self.b_fus=np.zeros(1)

    def forward(self, x_tab, x_seq):
        # Sequence branch
        emb=self.embed.forward(x_seq)             # (T, embed_dim)
        hs=self.lstm.forward(emb)                  # (T, lstm_hidden)
        seq_repr,attn_w=self.attn.forward(hs)      # (lstm_hidden,), (T,)
        # Tabular branch
        tab_repr=self.tab_mlp.forward(x_tab)       # (fusion_dim,)
        # Fusion
        fused=np.concatenate([seq_repr,tab_repr])  # (lstm_hidden+fusion_dim,)
        logit=float(self.W_fus@fused+self.b_fus)
        return sigmoid(np.array([logit]))[0], fused, attn_w

model=MultiModalClassifier()
n_params=(model.embed.E.size+model.lstm.W.size+model.lstm.b.size+
          model.attn.w.size+model.tab_mlp.W1.size+model.tab_mlp.W2.size+
          model.W_fus.size)
print(f"Multi-modal model — total parameters: {n_params}")

# Sanity check
p,fused,attn=model.forward(X_tab_tr[0],X_seq_tr[0])
print(f"Sample prediction: p={p:.4f}  true label={y_tr[0]}")
print(f"Attention weights (sum={attn.sum():.4f}): {attn.round(3)}")

Part 3 — Training

def bce(p,y,eps=1e-10):
    p=np.clip(p,eps,1-eps)
    return float(-(y*np.log(p)+(1-y)*np.log(1-p)))

EPOCHS=300; lr=5e-3
train_losses=[]; val_accs=[]

# Cosine LR schedule  (ch313)
def cosine_lr(epoch, lr0=5e-3, lr_min=1e-4, T=300):
    return lr_min+0.5*(lr0-lr_min)*(1+np.cos(np.pi*epoch/T))

rng2=np.random.default_rng(1)

# All trainable parameters (flat view for numerical grad)
def get_all_params():
    return [model.embed.E, model.lstm.W, model.lstm.b,
            model.attn.w, model.tab_mlp.W1, model.tab_mlp.b1,
            model.tab_mlp.W2, model.tab_mlp.b2, model.W_fus, model.b_fus]

for epoch in range(EPOCHS):
    lr_t=cosine_lr(epoch)
    # Sample one example per epoch (illustration; prod uses mini-batches)
    idx=rng2.integers(0,n_tr)
    xt,xs,yt=X_tab_tr[idx],X_seq_tr[idx],y_tr[idx]
    p,_,_=model.forward(xt,xs); loss=bce(p,yt)
    train_losses.append(loss)

    # Numerical gradient update (sparse)
    eps_fd=1e-4
    for P in get_all_params():
        flat=P.ravel(); n_s=max(1,len(flat)//20)
        for i in rng2.choice(len(flat),n_s,replace=False):
            flat[i]+=eps_fd; pp,_,_=model.forward(xt,xs); lp=bce(pp,yt)
            flat[i]-=2*eps_fd; pm2,_,_=model.forward(xt,xs); lm=bce(pm2,yt)
            flat[i]+=eps_fd; flat[i]-=lr_t*(lp-lm)/(2*eps_fd)

    # Val accuracy every 50 epochs
    if (epoch+1)%50==0:
        n_va_samp=len(y_va)
        preds=np.array([int(model.forward(X_tab_va[i],X_seq_va[i])[0]>0.5)
                         for i in range(n_va_samp)])
        val_acc=float(np.mean(preds==y_va)); val_accs.append(val_acc)
        print(f"Epoch {epoch+1:3d}: loss={loss:.4f}  lr={lr_t:.5f}  val_acc={val_acc:.1%}")

Part 4 — Evaluation and Interpretability

# Test accuracy
te_preds=np.array([int(model.forward(X_tab_te[i],X_seq_te[i])[0]>0.5)
                    for i in range(len(y_te))])
te_probs=np.array([model.forward(X_tab_te[i],X_seq_te[i])[0] for i in range(len(y_te))])
te_acc=float(np.mean(te_preds==y_te))

fig,axes=plt.subplots(2,3,figsize=(15,9))

# Training loss
window=20
sl=np.convolve(train_losses,np.ones(window)/window,mode='valid')
axes[0,0].plot(train_losses,alpha=0.3,color='#3498db',lw=1)
axes[0,0].plot(sl,color='#e74c3c',lw=2,label=f'{window}-step avg')
axes[0,0].set_title('Training Loss (BCE)'); axes[0,0].set_xlabel('Epoch'); axes[0,0].legend()

# Val accuracy
axes[0,1].plot(range(50,EPOCHS+1,50),val_accs,'o-',color='#2ecc71',lw=2)
axes[0,1].axhline(te_acc,color='#e74c3c',linestyle='--',lw=2,label=f'Test acc={te_acc:.1%}')
axes[0,1].set_title('Validation Accuracy'); axes[0,1].set_xlabel('Epoch'); axes[0,1].legend()
axes[0,1].set_ylim(0,1)

# Calibration plot  (ch279)
bins=5; bin_edges=np.linspace(0,1,bins+1)
bin_accs=[]; bin_confs=[]
for b in range(bins):
    lo,hi=bin_edges[b],bin_edges[b+1]
    m=(te_probs>=lo)&(te_probs<hi)
    if m.sum()>0:
        bin_accs.append(float(y_te[m].mean())); bin_confs.append(float(te_probs[m].mean()))
bin_mids=np.array(bin_confs)
axes[0,2].plot([0,1],[0,1],'k--',lw=1.5,label='Perfect calibration')
axes[0,2].scatter(bin_mids,bin_accs,s=80,color='#9b59b6',zorder=5)
axes[0,2].plot(bin_mids,bin_accs,color='#9b59b6',lw=2,label='Model')
axes[0,2].set_title('Calibration plot'); axes[0,2].set_xlabel('Mean predicted probability')
axes[0,2].set_ylabel('Fraction of positives'); axes[0,2].legend()

# Attention weights for a positive and negative example  (ch321)
for ax_idx,(target_label,title) in enumerate([(1,'Positive example'),(0,'Negative example')]):
    m=np.where(y_te==target_label)[0]
    if len(m)==0: continue
    i=m[0]; xt,xs=X_tab_te[i],X_seq_te[i]
    p,_,attn=model.forward(xt,xs)
    ax=axes[1,ax_idx]
    ax.bar(range(T_seq),attn,color='#3498db',alpha=0.8)
    ax.set_title(f'{title} (p={p:.3f}, true={target_label})
Seq: {xs.tolist()}')
    ax.set_xlabel('Token position'); ax.set_ylabel('Attention weight')
    ax.set_xticks(range(T_seq))

# Input gradient attribution  (ch331)
xt_att,xs_att=X_tab_te[0],X_seq_te[0]
eps_g=1e-4
tab_grads=np.zeros(D_tab)
for fi in range(D_tab):
    xt2=xt_att.copy(); xt2[fi]+=eps_g; pp,_,_=model.forward(xt2,xs_att)
    xt2[fi]-=2*eps_g; pm2,_,_=model.forward(xt2,xs_att)
    tab_grads[fi]=(pp-pm2)/(2*eps_g)

ax=axes[1,2]
colors_g=['#e74c3c' if v>0 else '#3498db' for v in tab_grads]
ax.bar([f'Tab f{i}' for i in range(D_tab)],tab_grads,color=colors_g)
ax.set_title('Tabular feature gradients\n(attribution for first test sample)')
ax.set_ylabel('dOutput/dFeature')

plt.suptitle(f'ch340 Capstone II — Multi-Modal DL System  |  Test acc: {te_acc:.1%}',fontsize=12)
plt.tight_layout(); plt.savefig('ch340_capstone.png',dpi=120); plt.show()
print(f"\nFinal test accuracy: {te_acc:.1%}")

Part 5 — Retrospective: The Full Mathematical Arc

From ch001 to ch340

This book began with a simple question: why should a programmer learn mathematics?

The answer revealed itself chapter by chapter. Here is the complete arc:

PartChaptersContribution to this Capstone
I — Mathematical Thinking1–20The habit of abstraction that made every subsequent concept accessible
II — Number Foundations21–50Floating point, exponentials, logarithms — the substrate of all computation
III — Functions51–90Every layer is a function; composition is the network; loss is the objective
IV — Geometry91–120Embeddings live in geometric spaces; attention is dot-product geometry
V — Vectors121–150Every representation is a vector; cosine similarity measures semantic closeness
VI — Linear Algebra151–200Matrix multiplication is the forward pass; SVD explains what embeddings do
VII — Calculus201–240Backpropagation is the chain rule; gradient descent is optimisation on a surface
VIII — Probability241–270Cross-entropy is negative log-likelihood; dropout is approximate inference
IX — Statistics271–300Train/val/test split; calibration; bias-variance tradeoff
X — Deep Learning301–340All of the above, composed at scale

The single insight this Capstone demonstrates

A trained neural network is a compressed, differentiable program extracted from data by repeated application of the chain rule. It has no more magic than that — and no less.


Part 6 — Five Open Directions

1. Large Language Models

Scale the architecture to billions of parameters using the training infrastructure from ch333. Apply self-supervised learning (ch324) on web-scale text. The mathematics is identical — only the engineering changes.

2. Multimodal Foundation Models

Connect vision (CNN or ViT (ch316)) with language (Transformer (ch322)) using contrastive learning (ch324). This is how CLIP, Flamingo, and GPT-4V work.

3. AI Safety and Alignment

Use RLHF (ch330) to align model outputs with human preferences. The mathematical challenge: reward models are trained on human feedback; PPO optimises the policy against that reward. Understanding the failure modes requires interpretability tools (ch331) and scaling law analysis (ch332).

4. Scientific Discovery

Apply GNNs (ch328) to molecular graphs for drug discovery. Apply diffusion models (ch327) to protein structure generation. The same mathematics that generates images can generate molecules with desired properties.

5. Efficient Deep Learning

Apply quantisation (reduce weight precision from FP32 to INT8), pruning (remove near-zero weights), and knowledge distillation (train a small model to mimic a large one). All require understanding the full training pipeline from this Capstone.


You have now read the complete mathematical foundations of deep learning. The next chapter is not in this book — it is the problem you choose to solve.