ch325 — Generative Models: Variational Autoencoder

1. What a generative model does¶

A generative model learns $p(x)$ — the data distribution — well enough to sample new data and to perform density estimation. Unlike discriminative models ( $p(y|x)$ ), generative models model the entire data-generating process.

Autoencoders (non-generative baseline):

Encoder: $z = f_\phi(x)$ compresses $x$ to a latent code $z$ .
Decoder: $\hat{x} = g_\theta(z)$ reconstructs $x$ from $z$ .
Trained with reconstruction loss only.

Limitation: the latent space has no structure — you cannot sample from it.

2. Variational Autoencoder (Kingma & Welling, 2014)¶

The VAE treats $z$ as a random variable, not a deterministic code. The encoder outputs parameters of a distribution over $z$ :

q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma^2_\phi(x) I)

(1)

The decoder defines the likelihood:

p_\theta(x|z) = \mathcal{N}(g_\theta(z), I) \quad \text{(or Bernoulli for binary data)}

(2)

Training maximises the Evidence Lower Bound (ELBO):

\mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) \| p(z))

(3)

Reconstruction term: how well does the decoder reconstruct $x$ from sampled $z$ ?
KL regularisation: keep the posterior $q_\phi$ close to the prior $p(z) = \mathcal{N}(0,I)$ .

(KL divergence: ch295. Gaussian distribution: ch253. Expected value: ch249.)

import numpy as np
import matplotlib.pyplot as plt


def sigmoid(z): return 1/(1+np.exp(-np.clip(z,-500,500)))
def relu(z): return np.maximum(0, z)


class VAE:
    """
    Minimal VAE for 2D toy data.
    Encoder: 2 → 64 → 64 → (mu, log_var)  latent_dim
    Decoder: latent_dim → 64 → 64 → 2
    """

    def __init__(self, input_dim: int = 2, latent_dim: int = 2,
                 hidden_dim: int = 64, seed: int = 0):
        rng = np.random.default_rng(seed)
        s = lambda fi, fo: np.sqrt(2.0 / fi)

        # Encoder weights
        self.We1 = rng.normal(0, s(input_dim, hidden_dim),  (hidden_dim, input_dim))
        self.be1 = np.zeros(hidden_dim)
        self.We2 = rng.normal(0, s(hidden_dim, hidden_dim), (hidden_dim, hidden_dim))
        self.be2 = np.zeros(hidden_dim)
        self.W_mu     = rng.normal(0, 0.01, (latent_dim, hidden_dim))
        self.b_mu     = np.zeros(latent_dim)
        self.W_logvar = rng.normal(0, 0.01, (latent_dim, hidden_dim))
        self.b_logvar = np.zeros(latent_dim)

        # Decoder weights
        self.Wd1 = rng.normal(0, s(latent_dim, hidden_dim),  (hidden_dim, latent_dim))
        self.bd1 = np.zeros(hidden_dim)
        self.Wd2 = rng.normal(0, s(hidden_dim, hidden_dim), (hidden_dim, hidden_dim))
        self.bd2 = np.zeros(hidden_dim)
        self.Wd3 = rng.normal(0, s(hidden_dim, input_dim),  (input_dim,  hidden_dim))
        self.bd3 = np.zeros(input_dim)

        self.latent_dim = latent_dim

    def encode(self, x: np.ndarray) -> tuple:
        """x: (batch, input_dim). Returns (mu, log_var)."""
        h = relu(x @ self.We1.T + self.be1)
        h = relu(h @ self.We2.T + self.be2)
        mu      = h @ self.W_mu.T + self.b_mu
        log_var = h @ self.W_logvar.T + self.b_logvar
        return mu, log_var

    def reparameterise(self, mu: np.ndarray, log_var: np.ndarray,
                       rng: np.random.Generator) -> np.ndarray:
        """Sample z = mu + eps * std using the reparameterisation trick."""
        std = np.exp(0.5 * log_var)
        eps = rng.standard_normal(mu.shape)
        return mu + eps * std

    def decode(self, z: np.ndarray) -> np.ndarray:
        """z: (batch, latent_dim). Returns reconstructed x."""
        h = relu(z @ self.Wd1.T + self.bd1)
        h = relu(h @ self.Wd2.T + self.bd2)
        return h @ self.Wd3.T + self.bd3

    def elbo_loss(self, x: np.ndarray, x_hat: np.ndarray,
                  mu: np.ndarray, log_var: np.ndarray) -> tuple:
        """ELBO = -reconstruction - KL. Returns (total_loss, recon, kl)."""
        # Reconstruction: MSE (Gaussian likelihood)
        recon = np.mean((x - x_hat)**2)
        # KL divergence: closed form for Gaussian vs N(0,I)
        # KL = -0.5 * sum(1 + log_var - mu^2 - exp(log_var))
        kl = -0.5 * np.mean(1 + log_var - mu**2 - np.exp(log_var))
        return recon + kl, recon, kl


# ── Train VAE on a 2D mixture of Gaussians ──
rng = np.random.default_rng(42)
n_train = 800

# Data: 4 Gaussian clusters arranged in a ring
centres = [(np.cos(a), np.sin(a)) for a in np.linspace(0, 2*np.pi, 5)[:-1]]
X_train = np.vstack([rng.multivariate_normal(c, 0.04*np.eye(2), n_train//4) for c in centres])
rng.shuffle(X_train)

vae = VAE(input_dim=2, latent_dim=2, hidden_dim=32)
lr = 3e-3

# Simple SGD training (full batch for clarity)
losses = []
for epoch in range(300):
    mu, log_var = vae.encode(X_train)
    z = vae.reparameterise(mu, log_var, rng)
    x_hat = vae.decode(z)
    loss, recon, kl = vae.elbo_loss(X_train, x_hat, mu, log_var)
    losses.append(float(loss))

    # Numerical gradient (simplified — in practice use autograd)
    eps_grad = 1e-4
    for W_name in ['We1','We2','W_mu','W_logvar','Wd1','Wd2','Wd3']:
        W = getattr(vae, W_name)
        for idx in np.ndindex(*W.shape):
            if np.random.rand() > 0.05: continue  # sparse sampling for speed
            W[idx] += eps_grad
            mu2, lv2 = vae.encode(X_train)
            z2 = vae.reparameterise(mu2, lv2, rng)
            x2 = vae.decode(z2)
            l2, _, _ = vae.elbo_loss(X_train, x2, mu2, lv2)
            W[idx] -= 2*eps_grad
            mu3, lv3 = vae.encode(X_train)
            z3 = vae.reparameterise(mu3, lv3, rng)
            x3 = vae.decode(z3)
            l3, _, _ = vae.elbo_loss(X_train, x3, mu3, lv3)
            W[idx] += eps_grad
            grad = (l2 - l3) / (2*eps_grad)
            W[idx] -= lr * grad

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

axes[0].scatter(X_train[:,0], X_train[:,1], s=8, alpha=0.5, color='#3498db')
axes[0].set_title('Training data (4-cluster ring)'); axes[0].set_aspect('equal')

mu_enc, lv_enc = vae.encode(X_train)
axes[1].scatter(mu_enc[:,0], mu_enc[:,1], s=8, alpha=0.5, color='#e74c3c')
axes[1].set_title('Latent space (encoded means)'); axes[1].set_aspect('equal')

# Generate new samples from prior
z_sample = rng.standard_normal((200, 2))
x_gen = vae.decode(z_sample)
axes[2].scatter(x_gen[:,0], x_gen[:,1], s=8, alpha=0.5, color='#2ecc71')
axes[2].set_title('Generated samples (decoded from prior)')
axes[2].set_aspect('equal')

plt.suptitle('VAE: encode → latent space → decode → generate', fontsize=11)
plt.tight_layout()
plt.savefig('ch325_vae.png', dpi=120)
plt.show()
print(f"Final ELBO loss: {losses[-1]:.4f}")

3. The reparameterisation trick¶

To backpropagate through sampling $z \sim q_\phi(z|x)$ , we write:

z = \mu_\phi(x) + \varepsilon \cdot \sigma_\phi(x), \quad \varepsilon \sim \mathcal{N}(0, I)

(4)

The randomness is in $\varepsilon$ , which does not depend on $\phi$ . Gradients flow through $\mu_\phi$ and $\sigma_\phi$ normally.

4. KL term interpretation¶

The KL term penalises the encoder for making posteriors that differ from $\mathcal{N}(0,I)$ . This forces latent codes to be approximately standard normal — enabling interpolation and sampling.

The $\beta$ -VAE uses $\beta \cdot D_{\text{KL}}$ with $\beta > 1$ , increasing the pressure toward a disentangled latent space where each dimension encodes an independent factor.

5. Summary¶

Autoencoders compress to latent codes; VAEs make latent space a structured probability distribution.
ELBO = reconstruction term + KL regularisation; maximising ELBO = maximising marginal likelihood.
Reparameterisation trick enables gradient flow through sampling.
At generation time: sample $z \sim \mathcal{N}(0,I)$ , decode to get new data.

6. Forward and backward references¶

Used here: KL divergence (ch295), Gaussian distribution (ch253), expectation (ch249), backpropagation (ch306), encoder-decoder architecture (ch319).

This will reappear in ch338 — Project: Variational Autoencoder, where a VAE is trained on MNIST digits and the latent space is visualised and sampled.