1. The empirical observation¶
Kaplan et al. (2020) and Hoffmann et al. (2022) showed that neural language model performance follows power laws in model size , dataset size , and compute :
where , , .
These exponents are remarkably consistent across architectures and modalities.
(Power laws and logarithms: ch039–ch040. Model evaluation: ch279.)
import numpy as np
import matplotlib.pyplot as plt
# Simulate scaling law curves (matching Kaplan et al. 2020 form)
def kaplan_loss(N: np.ndarray, alpha: float = 0.076, N_c: float = 8.8e13) -> np.ndarray:
return (N_c / N) ** alpha
def chinchilla_optimal_n(C: float) -> float:
"""Hoffmann et al. 2022: optimal N given compute budget C FLOPs."""
# Roughly: N* = C^0.5 / 6 (simplified)
return (C / 6) ** 0.5
# Model size range: 10M to 100B parameters
N = np.logspace(7, 11, 200)
C = np.logspace(18, 24, 200) # FLOPs
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
# Loss vs parameters
axes[0].loglog(N, kaplan_loss(N, 0.076), color='#e74c3c', lw=2)
axes[0].set_title('Loss vs Parameters\n(power law)')
axes[0].set_xlabel('Parameters N'); axes[0].set_ylabel('Test Loss')
axes[0].grid(True, alpha=0.3)
axes[0].set_ylim(1.5, 5)
# Loss vs dataset tokens (similar power law)
D = np.logspace(9, 13, 200)
axes[1].loglog(D, kaplan_loss(D, 0.095, 5.4e13), color='#3498db', lw=2)
axes[1].set_title('Loss vs Dataset Size\n(power law)')
axes[1].set_xlabel('Dataset tokens D'); axes[1].set_ylabel('Test Loss')
axes[1].grid(True, alpha=0.3)
axes[1].set_ylim(1.5, 5)
# Chinchilla: optimal allocation of compute between N and D
opt_N = np.array([chinchilla_optimal_n(c) for c in C])
opt_D = C / (6 * opt_N) # D ≈ C / (6N)
axes[2].loglog(C, opt_N, label='Optimal N (parameters)', color='#e74c3c', lw=2)
axes[2].loglog(C, opt_D, label='Optimal D (tokens)', color='#3498db', lw=2)
axes[2].set_title('Chinchilla scaling:\nOptimal N and D vs Compute')
axes[2].set_xlabel('Compute budget C (FLOPs)')
axes[2].legend(fontsize=9)
axes[2].grid(True, alpha=0.3)
plt.suptitle('Neural scaling laws: performance is predictable from compute budget', fontsize=11)
plt.tight_layout()
plt.savefig('ch332_scaling_laws.png', dpi=120)
plt.show()
# Illustrate: double the model size, expected improvement
loss_1B = kaplan_loss(1e9)
loss_2B = kaplan_loss(2e9)
loss_10B = kaplan_loss(10e9)
print("Scaling law predictions (loss ∝ N^-0.076):")
print(f" 1B params: loss ≈ {loss_1B:.4f}")
print(f" 2B params: loss ≈ {loss_2B:.4f} ({(loss_1B/loss_2B - 1)*100:.1f}% improvement)")
print(f" 10B params: loss ≈ {loss_10B:.4f} ({(loss_1B/loss_10B - 1)*100:.1f}% improvement)")
print()
print("Diminishing returns: 10x params → only", round((loss_1B/loss_10B - 1)*100, 1), "% improvement.")
print("Chinchilla insight: models are often undertrained — more tokens matter as much as more params.")2. The Chinchilla result¶
Kaplan et al. (2020) recommended scaling model size faster than dataset size. Hoffmann et al. (2022) corrected this: for a fixed compute budget , model size and training tokens should be scaled equally:
GPT-3 (175B params, 300B tokens) was dramatically undertrained by Chinchilla standards. Chinchilla (70B, 1.4T tokens) matched GPT-3 quality at 2.5× fewer parameters.
3. Emergent capabilities¶
Some capabilities do not appear gradually — they emerge sharply at specific scale thresholds. Examples:
In-context learning (few-shot): emerges around 10B+ parameters.
Chain-of-thought reasoning: emerges around 100B+ parameters.
The mechanism is debated: it may be an artefact of measurement (capabilities exist earlier but are hard to detect) or a true phase transition in the learned computation.
4. Summary¶
Performance scales as a power law in N, D, and C.
Kaplan scaling: scale N faster. Chinchilla scaling: scale N and D equally.
Compute-optimal training: given budget C, allocate equally between params and tokens.
Emergent capabilities appear at specific scale thresholds — not gradual.
5. Forward and backward references¶
Used here: power laws (ch039), logarithmic scales (ch040), model evaluation (ch279).
This will reappear in ch333 — Training Infrastructure, where the practical implications of scaling (distributed training, memory bottlenecks) are discussed.