ch324 — Self-Supervised Learning - Mathematics for Programmers

1. Learning without labels¶

Labelling data is expensive. Self-supervised learning (SSL) creates pretext tasks from unlabelled data: the label comes from the data itself.

The model learns rich representations by solving these pretext tasks. Fine-tuning on a small labelled dataset then achieves strong performance on the actual task.

2. Masked Language Modelling (BERT)¶

BERT (Devlin et al., 2018): randomly mask 15% of tokens; predict the masked tokens.

\mathcal{L}_{\text{MLM}} = -\sum_{t \in \text{masked}} \log p(x_t | x_{\text{context}})

(1)

BERT uses a bidirectional Transformer encoder — the model sees the full context (both left and right of the mask). This makes it excellent for classification and extraction, but not for generation.

(Cross-entropy: ch305. Transformers: ch322. Embeddings: ch320.)

3. Causal Language Modelling (GPT)¶

GPT (Radford et al., 2018): predict the next token given all previous tokens.

\mathcal{L}_{\text{CLM}} = -\sum_t \log p(x_t | x_{<t})

(2)

Uses a decoder-only Transformer with causal masking. Can generate text autoregressively.

4. Contrastive Learning (SimCLR, CLIP)¶

Create two augmented views of the same sample; train representations so that views of the same sample are close and views of different samples are far apart.

\mathcal{L}_{\text{NT-Xent}} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k \neq i} \exp(\text{sim}(z_i, z_k)/\tau)}

(3)

CLIP (Radford et al., 2021) contrastively trains image and text encoders on 400M image-caption pairs. Zero-shot classification emerges naturally.

(KL divergence: ch295. Cosine similarity: ch132.)

import numpy as np
import matplotlib.pyplot as plt


def cosine_sim(a: np.ndarray, b: np.ndarray) -> float:
    return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-10))


def nt_xent_loss(z: np.ndarray, temperature: float = 0.5) -> float:
    """NT-Xent contrastive loss for a batch of (anchor, positive) pairs.

    z: (2N, d) — first N rows are anchors, next N rows are positives.
    Loss encourages z[i] and z[i+N] to be similar.
    """
    N = z.shape[0] // 2
    eps = 1e-8

    # Normalise
    z_norm = z / (np.linalg.norm(z, axis=1, keepdims=True) + eps)

    # Similarity matrix (2N x 2N)
    sim = z_norm @ z_norm.T / temperature

    # For each anchor i, positive is i+N (and vice versa)
    total_loss = 0.0
    for i in range(2 * N):
        positive_idx = (i + N) % (2 * N)
        # Mask out self
        denom_mask = np.ones(2 * N, dtype=bool)
        denom_mask[i] = False
        numerator = np.exp(sim[i, positive_idx])
        denominator = np.exp(sim[i][denom_mask]).sum()
        total_loss -= np.log(numerator / (denominator + eps) + eps)

    return total_loss / (2 * N)


# Simulate contrastive learning: embed items from 3 "classes"
# Augmentations = small perturbations within the same class
rng = np.random.default_rng(42)
n_classes, n_per_class, d = 3, 8, 32

# Initialise random embeddings
embeddings = rng.normal(0, 1, (n_classes * n_per_class, d))

# "Train": simulate gradient descent pulling same-class pairs together
lr = 0.05
losses = []
for step in range(200):
    # Pick random anchor and positive (same class), negative (different class)
    cls = rng.integers(0, n_classes)
    anchor_idx = cls * n_per_class + rng.integers(0, n_per_class)
    pos_idx    = cls * n_per_class + rng.integers(0, n_per_class)
    neg_cls    = rng.choice([c for c in range(n_classes) if c != cls])
    neg_idx    = neg_cls * n_per_class + rng.integers(0, n_per_class)

    za, zp, zn = embeddings[anchor_idx], embeddings[pos_idx], embeddings[neg_idx]
    # Triplet loss gradient (simplified)
    pos_sim = cosine_sim(za, zp)
    neg_sim = cosine_sim(za, zn)
    loss = max(0.0, 0.5 - pos_sim + neg_sim)
    losses.append(loss)

    if loss > 0:
        embeddings[anchor_idx] += lr * (zp - zn) / (np.linalg.norm(za) + 1e-8)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(losses, color='#3498db', lw=1.5, alpha=0.7)
axes[0].set_title('Contrastive triplet loss over steps')
axes[0].set_xlabel('Step'); axes[0].set_ylabel('Loss')

# PCA of final embeddings
E = embeddings
E_c = E - E.mean(0)
U, S, Vt = np.linalg.svd(E_c, full_matrices=False)
E2d = U[:, :2] * S[:2]

colors = ['#e74c3c', '#3498db', '#2ecc71']
for cls in range(n_classes):
    idxs = slice(cls*n_per_class, (cls+1)*n_per_class)
    axes[1].scatter(E2d[idxs, 0], E2d[idxs, 1], color=colors[cls],
                    s=60, label=f'Class {cls}', zorder=5)

axes[1].set_title('Embedding space after contrastive training\n(PCA to 2D)')
axes[1].legend()
plt.tight_layout()
plt.savefig('ch324_ssl.png', dpi=120)
plt.show()

5. Masked Image Modelling (MAE)¶

MAE (He et al., 2022) masks 75% of image patches and trains a Transformer to reconstruct the pixel values of masked patches. This forces learning of semantic structure (the model must infer what an eye looks like from the surrounding face patches).

6. Summary¶

SSL learns from unlabelled data via self-generated labels (masking, next-token, contrastive).
MLM (BERT): bidirectional, excellent for understanding tasks.
CLM (GPT): autoregressive, excellent for generation.
Contrastive (SimCLR, CLIP): learns invariant representations across augmentations.
SSL pretrained models are the starting point for nearly all modern NLP/vision pipelines.

7. Forward and backward references¶

Used here: cross-entropy (ch305), Transformers (ch322), cosine similarity (ch132), KL divergence (ch295).

This will reappear in ch332 — Scaling Laws, where SSL on web-scale data is the driving force behind emergent capabilities in large language models.