Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

1. The empirical observation

Kaplan et al. (2020) and Hoffmann et al. (2022) showed that neural language model performance follows power laws in model size NN, dataset size DD, and compute CC:

L(N)(NcN)αN,L(D)(DcD)αD,L(C)(CcC)αC\mathcal{L}(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad \mathcal{L}(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad \mathcal{L}(C) \approx \left(\frac{C_c}{C}\right)^{\alpha_C}

where αN0.076\alpha_N \approx 0.076, αD0.095\alpha_D \approx 0.095, αC0.050\alpha_C \approx 0.050.

These exponents are remarkably consistent across architectures and modalities.

(Power laws and logarithms: ch039–ch040. Model evaluation: ch279.)

import numpy as np
import matplotlib.pyplot as plt


# Simulate scaling law curves (matching Kaplan et al. 2020 form)
def kaplan_loss(N: np.ndarray, alpha: float = 0.076, N_c: float = 8.8e13) -> np.ndarray:
    return (N_c / N) ** alpha

def chinchilla_optimal_n(C: float) -> float:
    """Hoffmann et al. 2022: optimal N given compute budget C FLOPs."""
    # Roughly: N* = C^0.5 / 6 (simplified)
    return (C / 6) ** 0.5

# Model size range: 10M to 100B parameters
N = np.logspace(7, 11, 200)
C = np.logspace(18, 24, 200)  # FLOPs

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Loss vs parameters
axes[0].loglog(N, kaplan_loss(N, 0.076), color='#e74c3c', lw=2)
axes[0].set_title('Loss vs Parameters\n(power law)')
axes[0].set_xlabel('Parameters N'); axes[0].set_ylabel('Test Loss')
axes[0].grid(True, alpha=0.3)
axes[0].set_ylim(1.5, 5)

# Loss vs dataset tokens (similar power law)
D = np.logspace(9, 13, 200)
axes[1].loglog(D, kaplan_loss(D, 0.095, 5.4e13), color='#3498db', lw=2)
axes[1].set_title('Loss vs Dataset Size\n(power law)')
axes[1].set_xlabel('Dataset tokens D'); axes[1].set_ylabel('Test Loss')
axes[1].grid(True, alpha=0.3)
axes[1].set_ylim(1.5, 5)

# Chinchilla: optimal allocation of compute between N and D
opt_N = np.array([chinchilla_optimal_n(c) for c in C])
opt_D = C / (6 * opt_N)  # D ≈ C / (6N)
axes[2].loglog(C, opt_N, label='Optimal N (parameters)', color='#e74c3c', lw=2)
axes[2].loglog(C, opt_D, label='Optimal D (tokens)', color='#3498db', lw=2)
axes[2].set_title('Chinchilla scaling:\nOptimal N and D vs Compute')
axes[2].set_xlabel('Compute budget C (FLOPs)')
axes[2].legend(fontsize=9)
axes[2].grid(True, alpha=0.3)

plt.suptitle('Neural scaling laws: performance is predictable from compute budget', fontsize=11)
plt.tight_layout()
plt.savefig('ch332_scaling_laws.png', dpi=120)
plt.show()

# Illustrate: double the model size, expected improvement
loss_1B = kaplan_loss(1e9)
loss_2B = kaplan_loss(2e9)
loss_10B = kaplan_loss(10e9)
print("Scaling law predictions (loss ∝ N^-0.076):")
print(f"  1B params:  loss ≈ {loss_1B:.4f}")
print(f"  2B params:  loss ≈ {loss_2B:.4f}  ({(loss_1B/loss_2B - 1)*100:.1f}% improvement)")
print(f"  10B params: loss ≈ {loss_10B:.4f}  ({(loss_1B/loss_10B - 1)*100:.1f}% improvement)")
print()
print("Diminishing returns: 10x params → only", round((loss_1B/loss_10B - 1)*100, 1), "% improvement.")
print("Chinchilla insight: models are often undertrained — more tokens matter as much as more params.")

2. The Chinchilla result

Kaplan et al. (2020) recommended scaling model size faster than dataset size. Hoffmann et al. (2022) corrected this: for a fixed compute budget CC, model size NN and training tokens DD should be scaled equally:

NC/6,D6CN^* \approx \sqrt{C/6}, \quad D^* \approx \sqrt{6C}

GPT-3 (175B params, 300B tokens) was dramatically undertrained by Chinchilla standards. Chinchilla (70B, 1.4T tokens) matched GPT-3 quality at 2.5× fewer parameters.


3. Emergent capabilities

Some capabilities do not appear gradually — they emerge sharply at specific scale thresholds. Examples:

  • In-context learning (few-shot): emerges around 10B+ parameters.

  • Chain-of-thought reasoning: emerges around 100B+ parameters.

The mechanism is debated: it may be an artefact of measurement (capabilities exist earlier but are hard to detect) or a true phase transition in the learned computation.


4. Summary

  • Performance scales as a power law in N, D, and C.

  • Kaplan scaling: scale N faster. Chinchilla scaling: scale N and D equally.

  • Compute-optimal training: given budget C, allocate equally between params and tokens.

  • Emergent capabilities appear at specific scale thresholds — not gradual.


5. Forward and backward references

Used here: power laws (ch039), logarithmic scales (ch040), model evaluation (ch279).

This will reappear in ch333 — Training Infrastructure, where the practical implications of scaling (distributed training, memory bottlenecks) are discussed.