Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

ch325 — Generative Models: Variational Autoencoder

1. What a generative model does

A generative model learns p(x)p(x) — the data distribution — well enough to sample new data and to perform density estimation. Unlike discriminative models (p(yx)p(y|x)), generative models model the entire data-generating process.

Autoencoders (non-generative baseline):

  • Encoder: z=fϕ(x)z = f_\phi(x) compresses xx to a latent code zz.

  • Decoder: x^=gθ(z)\hat{x} = g_\theta(z) reconstructs xx from zz.

  • Trained with reconstruction loss only.

Limitation: the latent space has no structure — you cannot sample from it.


2. Variational Autoencoder (Kingma & Welling, 2014)

The VAE treats zz as a random variable, not a deterministic code. The encoder outputs parameters of a distribution over zz:

qϕ(zx)=N(μϕ(x),σϕ2(x)I)q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma^2_\phi(x) I)

The decoder defines the likelihood:

pθ(xz)=N(gθ(z),I)(or Bernoulli for binary data)p_\theta(x|z) = \mathcal{N}(g_\theta(z), I) \quad \text{(or Bernoulli for binary data)}

Training maximises the Evidence Lower Bound (ELBO):

LELBO=Eqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) \| p(z))
  • Reconstruction term: how well does the decoder reconstruct xx from sampled zz?

  • KL regularisation: keep the posterior qϕq_\phi close to the prior p(z)=N(0,I)p(z) = \mathcal{N}(0,I).

(KL divergence: ch295. Gaussian distribution: ch253. Expected value: ch249.)

import numpy as np
import matplotlib.pyplot as plt


def sigmoid(z): return 1/(1+np.exp(-np.clip(z,-500,500)))
def relu(z): return np.maximum(0, z)


class VAE:
    """
    Minimal VAE for 2D toy data.
    Encoder: 2 → 64 → 64 → (mu, log_var)  latent_dim
    Decoder: latent_dim → 64 → 64 → 2
    """

    def __init__(self, input_dim: int = 2, latent_dim: int = 2,
                 hidden_dim: int = 64, seed: int = 0):
        rng = np.random.default_rng(seed)
        s = lambda fi, fo: np.sqrt(2.0 / fi)

        # Encoder weights
        self.We1 = rng.normal(0, s(input_dim, hidden_dim),  (hidden_dim, input_dim))
        self.be1 = np.zeros(hidden_dim)
        self.We2 = rng.normal(0, s(hidden_dim, hidden_dim), (hidden_dim, hidden_dim))
        self.be2 = np.zeros(hidden_dim)
        self.W_mu     = rng.normal(0, 0.01, (latent_dim, hidden_dim))
        self.b_mu     = np.zeros(latent_dim)
        self.W_logvar = rng.normal(0, 0.01, (latent_dim, hidden_dim))
        self.b_logvar = np.zeros(latent_dim)

        # Decoder weights
        self.Wd1 = rng.normal(0, s(latent_dim, hidden_dim),  (hidden_dim, latent_dim))
        self.bd1 = np.zeros(hidden_dim)
        self.Wd2 = rng.normal(0, s(hidden_dim, hidden_dim), (hidden_dim, hidden_dim))
        self.bd2 = np.zeros(hidden_dim)
        self.Wd3 = rng.normal(0, s(hidden_dim, input_dim),  (input_dim,  hidden_dim))
        self.bd3 = np.zeros(input_dim)

        self.latent_dim = latent_dim

    def encode(self, x: np.ndarray) -> tuple:
        """x: (batch, input_dim). Returns (mu, log_var)."""
        h = relu(x @ self.We1.T + self.be1)
        h = relu(h @ self.We2.T + self.be2)
        mu      = h @ self.W_mu.T + self.b_mu
        log_var = h @ self.W_logvar.T + self.b_logvar
        return mu, log_var

    def reparameterise(self, mu: np.ndarray, log_var: np.ndarray,
                       rng: np.random.Generator) -> np.ndarray:
        """Sample z = mu + eps * std using the reparameterisation trick."""
        std = np.exp(0.5 * log_var)
        eps = rng.standard_normal(mu.shape)
        return mu + eps * std

    def decode(self, z: np.ndarray) -> np.ndarray:
        """z: (batch, latent_dim). Returns reconstructed x."""
        h = relu(z @ self.Wd1.T + self.bd1)
        h = relu(h @ self.Wd2.T + self.bd2)
        return h @ self.Wd3.T + self.bd3

    def elbo_loss(self, x: np.ndarray, x_hat: np.ndarray,
                  mu: np.ndarray, log_var: np.ndarray) -> tuple:
        """ELBO = -reconstruction - KL. Returns (total_loss, recon, kl)."""
        # Reconstruction: MSE (Gaussian likelihood)
        recon = np.mean((x - x_hat)**2)
        # KL divergence: closed form for Gaussian vs N(0,I)
        # KL = -0.5 * sum(1 + log_var - mu^2 - exp(log_var))
        kl = -0.5 * np.mean(1 + log_var - mu**2 - np.exp(log_var))
        return recon + kl, recon, kl


# ── Train VAE on a 2D mixture of Gaussians ──
rng = np.random.default_rng(42)
n_train = 800

# Data: 4 Gaussian clusters arranged in a ring
centres = [(np.cos(a), np.sin(a)) for a in np.linspace(0, 2*np.pi, 5)[:-1]]
X_train = np.vstack([rng.multivariate_normal(c, 0.04*np.eye(2), n_train//4) for c in centres])
rng.shuffle(X_train)

vae = VAE(input_dim=2, latent_dim=2, hidden_dim=32)
lr = 3e-3

# Simple SGD training (full batch for clarity)
losses = []
for epoch in range(300):
    mu, log_var = vae.encode(X_train)
    z = vae.reparameterise(mu, log_var, rng)
    x_hat = vae.decode(z)
    loss, recon, kl = vae.elbo_loss(X_train, x_hat, mu, log_var)
    losses.append(float(loss))

    # Numerical gradient (simplified — in practice use autograd)
    eps_grad = 1e-4
    for W_name in ['We1','We2','W_mu','W_logvar','Wd1','Wd2','Wd3']:
        W = getattr(vae, W_name)
        for idx in np.ndindex(*W.shape):
            if np.random.rand() > 0.05: continue  # sparse sampling for speed
            W[idx] += eps_grad
            mu2, lv2 = vae.encode(X_train)
            z2 = vae.reparameterise(mu2, lv2, rng)
            x2 = vae.decode(z2)
            l2, _, _ = vae.elbo_loss(X_train, x2, mu2, lv2)
            W[idx] -= 2*eps_grad
            mu3, lv3 = vae.encode(X_train)
            z3 = vae.reparameterise(mu3, lv3, rng)
            x3 = vae.decode(z3)
            l3, _, _ = vae.elbo_loss(X_train, x3, mu3, lv3)
            W[idx] += eps_grad
            grad = (l2 - l3) / (2*eps_grad)
            W[idx] -= lr * grad

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

axes[0].scatter(X_train[:,0], X_train[:,1], s=8, alpha=0.5, color='#3498db')
axes[0].set_title('Training data (4-cluster ring)'); axes[0].set_aspect('equal')

mu_enc, lv_enc = vae.encode(X_train)
axes[1].scatter(mu_enc[:,0], mu_enc[:,1], s=8, alpha=0.5, color='#e74c3c')
axes[1].set_title('Latent space (encoded means)'); axes[1].set_aspect('equal')

# Generate new samples from prior
z_sample = rng.standard_normal((200, 2))
x_gen = vae.decode(z_sample)
axes[2].scatter(x_gen[:,0], x_gen[:,1], s=8, alpha=0.5, color='#2ecc71')
axes[2].set_title('Generated samples (decoded from prior)')
axes[2].set_aspect('equal')

plt.suptitle('VAE: encode → latent space → decode → generate', fontsize=11)
plt.tight_layout()
plt.savefig('ch325_vae.png', dpi=120)
plt.show()
print(f"Final ELBO loss: {losses[-1]:.4f}")

3. The reparameterisation trick

To backpropagate through sampling zqϕ(zx)z \sim q_\phi(z|x), we write:

z=μϕ(x)+εσϕ(x),εN(0,I)z = \mu_\phi(x) + \varepsilon \cdot \sigma_\phi(x), \quad \varepsilon \sim \mathcal{N}(0, I)

The randomness is in ε\varepsilon, which does not depend on ϕ\phi. Gradients flow through μϕ\mu_\phi and σϕ\sigma_\phi normally.


4. KL term interpretation

The KL term penalises the encoder for making posteriors that differ from N(0,I)\mathcal{N}(0,I). This forces latent codes to be approximately standard normal — enabling interpolation and sampling.

The β\beta-VAE uses βDKL\beta \cdot D_{\text{KL}} with β>1\beta > 1, increasing the pressure toward a disentangled latent space where each dimension encodes an independent factor.


5. Summary

  • Autoencoders compress to latent codes; VAEs make latent space a structured probability distribution.

  • ELBO = reconstruction term + KL regularisation; maximising ELBO = maximising marginal likelihood.

  • Reparameterisation trick enables gradient flow through sampling.

  • At generation time: sample zN(0,I)z \sim \mathcal{N}(0,I), decode to get new data.


6. Forward and backward references

Used here: KL divergence (ch295), Gaussian distribution (ch253), expectation (ch249), backpropagation (ch306), encoder-decoder architecture (ch319).

This will reappear in ch338 — Project: Variational Autoencoder, where a VAE is trained on MNIST digits and the latent space is visualised and sampled.