ch332 — Scaling Laws - Mathematics for Programmers

1. The empirical observation¶

Kaplan et al. (2020) and Hoffmann et al. (2022) showed that neural language model performance follows power laws in model size $N$ , dataset size $D$ , and compute $C$ :

\mathcal{L}(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad \mathcal{L}(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad \mathcal{L}(C) \approx \left(\frac{C_c}{C}\right)^{\alpha_C}

(1)

where $\alpha_N \approx 0.076$ , $\alpha_D \approx 0.095$ , $\alpha_C \approx 0.050$ .

These exponents are remarkably consistent across architectures and modalities.

(Power laws and logarithms: ch039–ch040. Model evaluation: ch279.)

import numpy as np
import matplotlib.pyplot as plt


# Simulate scaling law curves (matching Kaplan et al. 2020 form)
def kaplan_loss(N: np.ndarray, alpha: float = 0.076, N_c: float = 8.8e13) -> np.ndarray:
    return (N_c / N) ** alpha

def chinchilla_optimal_n(C: float) -> float:
    """Hoffmann et al. 2022: optimal N given compute budget C FLOPs."""
    # Roughly: N* = C^0.5 / 6 (simplified)
    return (C / 6) ** 0.5

# Model size range: 10M to 100B parameters
N = np.logspace(7, 11, 200)
C = np.logspace(18, 24, 200)  # FLOPs

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Loss vs parameters
axes[0].loglog(N, kaplan_loss(N, 0.076), color='#e74c3c', lw=2)
axes[0].set_title('Loss vs Parameters\n(power law)')
axes[0].set_xlabel('Parameters N'); axes[0].set_ylabel('Test Loss')
axes[0].grid(True, alpha=0.3)
axes[0].set_ylim(1.5, 5)

# Loss vs dataset tokens (similar power law)
D = np.logspace(9, 13, 200)
axes[1].loglog(D, kaplan_loss(D, 0.095, 5.4e13), color='#3498db', lw=2)
axes[1].set_title('Loss vs Dataset Size\n(power law)')
axes[1].set_xlabel('Dataset tokens D'); axes[1].set_ylabel('Test Loss')
axes[1].grid(True, alpha=0.3)
axes[1].set_ylim(1.5, 5)

# Chinchilla: optimal allocation of compute between N and D
opt_N = np.array([chinchilla_optimal_n(c) for c in C])
opt_D = C / (6 * opt_N)  # D ≈ C / (6N)
axes[2].loglog(C, opt_N, label='Optimal N (parameters)', color='#e74c3c', lw=2)
axes[2].loglog(C, opt_D, label='Optimal D (tokens)', color='#3498db', lw=2)
axes[2].set_title('Chinchilla scaling:\nOptimal N and D vs Compute')
axes[2].set_xlabel('Compute budget C (FLOPs)')
axes[2].legend(fontsize=9)
axes[2].grid(True, alpha=0.3)

plt.suptitle('Neural scaling laws: performance is predictable from compute budget', fontsize=11)
plt.tight_layout()
plt.savefig('ch332_scaling_laws.png', dpi=120)
plt.show()

# Illustrate: double the model size, expected improvement
loss_1B = kaplan_loss(1e9)
loss_2B = kaplan_loss(2e9)
loss_10B = kaplan_loss(10e9)
print("Scaling law predictions (loss ∝ N^-0.076):")
print(f"  1B params:  loss ≈ {loss_1B:.4f}")
print(f"  2B params:  loss ≈ {loss_2B:.4f}  ({(loss_1B/loss_2B - 1)*100:.1f}% improvement)")
print(f"  10B params: loss ≈ {loss_10B:.4f}  ({(loss_1B/loss_10B - 1)*100:.1f}% improvement)")
print()
print("Diminishing returns: 10x params → only", round((loss_1B/loss_10B - 1)*100, 1), "% improvement.")
print("Chinchilla insight: models are often undertrained — more tokens matter as much as more params.")

2. The Chinchilla result¶

Kaplan et al. (2020) recommended scaling model size faster than dataset size. Hoffmann et al. (2022) corrected this: for a fixed compute budget $C$ , model size $N$ and training tokens $D$ should be scaled equally:

N^* \approx \sqrt{C/6}, \quad D^* \approx \sqrt{6C}

(2)

GPT-3 (175B params, 300B tokens) was dramatically undertrained by Chinchilla standards. Chinchilla (70B, 1.4T tokens) matched GPT-3 quality at 2.5× fewer parameters.

3. Emergent capabilities¶

Some capabilities do not appear gradually — they emerge sharply at specific scale thresholds. Examples:

In-context learning (few-shot): emerges around 10B+ parameters.
Chain-of-thought reasoning: emerges around 100B+ parameters.

The mechanism is debated: it may be an artefact of measurement (capabilities exist earlier but are hard to detect) or a true phase transition in the learned computation.

4. Summary¶

Performance scales as a power law in N, D, and C.
Kaplan scaling: scale N faster. Chinchilla scaling: scale N and D equally.
Compute-optimal training: given budget C, allocate equally between params and tokens.
Emergent capabilities appear at specific scale thresholds — not gradual.

5. Forward and backward references¶

Used here: power laws (ch039), logarithmic scales (ch040), model evaluation (ch279).

This will reappear in ch333 — Training Infrastructure, where the practical implications of scaling (distributed training, memory bottlenecks) are discussed.