Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

ch327 — Diffusion Models Intuition

1. The core idea

Diffusion models (Ho et al., 2020) learn to reverse a gradual noising process.

Forward process: add small Gaussian noise at each of TT steps until xTN(0,I)x_T \sim \mathcal{N}(0,I). Reverse process: learn a neural network that removes the noise step by step.

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}\, x_{t-1},\, \beta_t I)
pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1};\, \mu_\theta(x_t,t),\, \Sigma_\theta(x_t,t))

A key property: xtx_t can be sampled directly from x0x_0 in closed form:

xt=αˉtx0+1αˉtε,εN(0,I)x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1-\bar{\alpha}_t}\, \varepsilon, \quad \varepsilon \sim \mathcal{N}(0,I)

where αˉt=s=1t(1βs)\bar{\alpha}_t = \prod_{s=1}^t (1-\beta_s).

The network is trained to predict the noise ε\varepsilon added at each step — a simple MSE loss:

L=Ex0,ε,t[εεθ(xt,t)2]\mathcal{L} = \mathbb{E}_{x_0, \varepsilon, t}\left[\|\varepsilon - \varepsilon_\theta(x_t, t)\|^2\right]

(Gaussian distribution: ch253. ELBO connection: ch325. Markov chains: ch258.)

import numpy as np
import matplotlib.pyplot as plt


class DDPM1D:
    """
    Denoising Diffusion Probabilistic Model for 1D data.
    Demonstrates the forward process and (analytically) the reverse.
    """

    def __init__(self, T: int = 100, beta_start: float = 1e-4, beta_end: float = 0.02):
        self.T = T
        self.betas = np.linspace(beta_start, beta_end, T)
        self.alphas = 1.0 - self.betas
        self.alpha_bars = np.cumprod(self.alphas)

    def q_sample(self, x0: np.ndarray, t: int, rng: np.random.Generator) -> tuple:
        """Sample x_t from x_0 directly (closed form)."""
        sqrt_ab = np.sqrt(self.alpha_bars[t])
        sqrt_1mab = np.sqrt(1 - self.alpha_bars[t])
        eps = rng.standard_normal(x0.shape)
        x_t = sqrt_ab * x0 + sqrt_1mab * eps
        return x_t, eps

    def noise_level_at(self, t: int) -> float:
        return float(np.sqrt(1 - self.alpha_bars[t]))


rng = np.random.default_rng(0)
ddpm = DDPM1D(T=200)

# 1D toy data: bimodal Gaussian
x0 = np.concatenate([rng.normal(-2, 0.3, 200), rng.normal(2, 0.3, 200)])

# Show forward process: data → noise
timesteps_to_show = [0, 20, 50, 100, 150, 199]

fig, axes = plt.subplots(2, 3, figsize=(13, 7))
for ax, t in zip(axes.ravel(), timesteps_to_show):
    x_t, _ = ddpm.q_sample(x0, t, rng)
    ax.hist(x_t, bins=50, density=True, color='#3498db', alpha=0.75, edgecolor='white')
    noise_level = ddpm.noise_level_at(t)
    ax.set_title(f't={t}   noise_level={noise_level:.3f}')
    ax.set_xlim(-5, 5)
    ax.set_ylim(0, 1.2)
    if t == 0:
        ax.set_ylabel('Density')

plt.suptitle('Forward diffusion process: data gradually becomes pure noise', fontsize=11)
plt.tight_layout()
plt.savefig('ch327_diffusion_forward.png', dpi=120)
plt.show()

# Show noise schedule
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
t_axis = np.arange(ddpm.T)
axes[0].plot(ddpm.betas, color='#e74c3c', lw=2)
axes[0].set_title('Beta schedule (noise added per step)')
axes[0].set_xlabel('Timestep t'); axes[0].set_ylabel('β_t')

axes[1].plot(ddpm.alpha_bars, color='#3498db', lw=2, label='ᾱ_t (signal fraction)')
axes[1].plot(1-ddpm.alpha_bars, color='#e74c3c', lw=2, label='1-ᾱ_t (noise fraction)')
axes[1].set_title('Signal vs noise fraction across timesteps')
axes[1].set_xlabel('Timestep t'); axes[1].legend()

plt.tight_layout()
plt.savefig('ch327_diffusion_schedule.png', dpi=120)
plt.show()

2. The denoising network

The network εθ(xt,t)\varepsilon_\theta(x_t, t) must:

  1. Accept the noisy sample xtx_t and timestep tt as inputs.

  2. Predict the noise ε\varepsilon that was added.

For images: U-Net architecture with time embedding injected at each layer. For text (Diffusion-LM): Transformer with time conditioning.

Time embedding: tt is embedded as a sinusoidal positional encoding (ch323) and projected into the network via scale-and-shift in each layer.


3. Sampling

To generate new data from a trained diffusion model:

  1. Start from xTN(0,I)x_T \sim \mathcal{N}(0,I).

  2. Repeatedly apply the learned denoising step:

    xt1=1αt(xtβt1αˉtεθ(xt,t))+σtεx_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \varepsilon_\theta(x_t, t)\right) + \sigma_t \varepsilon
  3. After TT steps, x0x_0 is a new sample.

DDIM (Song et al., 2020) enables sampling in 10–50 steps instead of 1000 by making the reverse process deterministic.


4. Why diffusion beats GANs

PropertyGANsDiffusion
Training stabilityAdversarial; unstableMSE loss; stable
Sample diversityMode collapse riskFull distribution coverage
Sample qualityVery high (StyleGAN)Higher (DALL-E 2, Stable Diffusion)
Inference speedFast (one forward pass)Slow (hundreds of steps)

5. Summary

  • Diffusion models: learn to reverse a Markov chain that gradually adds Gaussian noise.

  • Closed-form forward sampling enables efficient training (directly compute xtx_t from x0x_0).

  • Training objective: predict the noise added, with MSE loss.

  • Reverse process is iterative (slow but high quality); DDIM accelerates it.


6. Forward and backward references

Used here: Markov chains (ch258), Gaussian distribution (ch253), ELBO (ch325), MSE loss (ch305), positional encoding for time (ch323).

This will reappear in ch340 — Capstone II, where a simple 1D diffusion model is trained as part of the end-to-end deep learning system.