ch327 — Diffusion Models Intuition - Mathematics for Programmers

1. The core idea¶

Diffusion models (Ho et al., 2020) learn to reverse a gradual noising process.

Forward process: add small Gaussian noise at each of $T$ steps until $x_T \sim \mathcal{N}(0,I)$ . Reverse process: learn a neural network that removes the noise step by step.

q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}\, x_{t-1},\, \beta_t I)

(1)

p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1};\, \mu_\theta(x_t,t),\, \Sigma_\theta(x_t,t))

(2)

A key property: $x_t$ can be sampled directly from $x_0$ in closed form:

x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1-\bar{\alpha}_t}\, \varepsilon, \quad \varepsilon \sim \mathcal{N}(0,I)

(3)

where $\bar{\alpha}_t = \prod_{s=1}^t (1-\beta_s)$ .

The network is trained to predict the noise $\varepsilon$ added at each step — a simple MSE loss:

\mathcal{L} = \mathbb{E}_{x_0, \varepsilon, t}\left[\|\varepsilon - \varepsilon_\theta(x_t, t)\|^2\right]

(4)

(Gaussian distribution: ch253. ELBO connection: ch325. Markov chains: ch258.)

import numpy as np
import matplotlib.pyplot as plt


class DDPM1D:
    """
    Denoising Diffusion Probabilistic Model for 1D data.
    Demonstrates the forward process and (analytically) the reverse.
    """

    def __init__(self, T: int = 100, beta_start: float = 1e-4, beta_end: float = 0.02):
        self.T = T
        self.betas = np.linspace(beta_start, beta_end, T)
        self.alphas = 1.0 - self.betas
        self.alpha_bars = np.cumprod(self.alphas)

    def q_sample(self, x0: np.ndarray, t: int, rng: np.random.Generator) -> tuple:
        """Sample x_t from x_0 directly (closed form)."""
        sqrt_ab = np.sqrt(self.alpha_bars[t])
        sqrt_1mab = np.sqrt(1 - self.alpha_bars[t])
        eps = rng.standard_normal(x0.shape)
        x_t = sqrt_ab * x0 + sqrt_1mab * eps
        return x_t, eps

    def noise_level_at(self, t: int) -> float:
        return float(np.sqrt(1 - self.alpha_bars[t]))


rng = np.random.default_rng(0)
ddpm = DDPM1D(T=200)

# 1D toy data: bimodal Gaussian
x0 = np.concatenate([rng.normal(-2, 0.3, 200), rng.normal(2, 0.3, 200)])

# Show forward process: data → noise
timesteps_to_show = [0, 20, 50, 100, 150, 199]

fig, axes = plt.subplots(2, 3, figsize=(13, 7))
for ax, t in zip(axes.ravel(), timesteps_to_show):
    x_t, _ = ddpm.q_sample(x0, t, rng)
    ax.hist(x_t, bins=50, density=True, color='#3498db', alpha=0.75, edgecolor='white')
    noise_level = ddpm.noise_level_at(t)
    ax.set_title(f't={t}   noise_level={noise_level:.3f}')
    ax.set_xlim(-5, 5)
    ax.set_ylim(0, 1.2)
    if t == 0:
        ax.set_ylabel('Density')

plt.suptitle('Forward diffusion process: data gradually becomes pure noise', fontsize=11)
plt.tight_layout()
plt.savefig('ch327_diffusion_forward.png', dpi=120)
plt.show()

# Show noise schedule
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
t_axis = np.arange(ddpm.T)
axes[0].plot(ddpm.betas, color='#e74c3c', lw=2)
axes[0].set_title('Beta schedule (noise added per step)')
axes[0].set_xlabel('Timestep t'); axes[0].set_ylabel('β_t')

axes[1].plot(ddpm.alpha_bars, color='#3498db', lw=2, label='ᾱ_t (signal fraction)')
axes[1].plot(1-ddpm.alpha_bars, color='#e74c3c', lw=2, label='1-ᾱ_t (noise fraction)')
axes[1].set_title('Signal vs noise fraction across timesteps')
axes[1].set_xlabel('Timestep t'); axes[1].legend()

plt.tight_layout()
plt.savefig('ch327_diffusion_schedule.png', dpi=120)
plt.show()

2. The denoising network¶

The network $\varepsilon_\theta(x_t, t)$ must:

Accept the noisy sample $x_t$ and timestep $t$ as inputs.
Predict the noise $\varepsilon$ that was added.

For images: U-Net architecture with time embedding injected at each layer. For text (Diffusion-LM): Transformer with time conditioning.

Time embedding: $t$ is embedded as a sinusoidal positional encoding (ch323) and projected into the network via scale-and-shift in each layer.

3. Sampling¶

To generate new data from a trained diffusion model:

Start from $x_T \sim \mathcal{N}(0,I)$ .
Repeatedly apply the learned denoising step:
$x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \varepsilon_\theta(x_t, t)\right) + \sigma_t \varepsilon$
(5)
After $T$ steps, $x_0$ is a new sample.

DDIM (Song et al., 2020) enables sampling in 10–50 steps instead of 1000 by making the reverse process deterministic.

4. Why diffusion beats GANs¶

Property	GANs	Diffusion
Training stability	Adversarial; unstable	MSE loss; stable
Sample diversity	Mode collapse risk	Full distribution coverage
Sample quality	Very high (StyleGAN)	Higher (DALL-E 2, Stable Diffusion)
Inference speed	Fast (one forward pass)	Slow (hundreds of steps)

5. Summary¶

Diffusion models: learn to reverse a Markov chain that gradually adds Gaussian noise.
Closed-form forward sampling enables efficient training (directly compute $x_t$ from $x_0$ ).
Training objective: predict the noise added, with MSE loss.
Reverse process is iterative (slow but high quality); DDIM accelerates it.

6. Forward and backward references¶

Used here: Markov chains (ch258), Gaussian distribution (ch253), ELBO (ch325), MSE loss (ch305), positional encoding for time (ch323).

This will reappear in ch340 — Capstone II, where a simple 1D diffusion model is trained as part of the end-to-end deep learning system.