1. What a generative model does¶
A generative model learns — the data distribution — well enough to sample new data and to perform density estimation. Unlike discriminative models (), generative models model the entire data-generating process.
Autoencoders (non-generative baseline):
Encoder: compresses to a latent code .
Decoder: reconstructs from .
Trained with reconstruction loss only.
Limitation: the latent space has no structure — you cannot sample from it.
2. Variational Autoencoder (Kingma & Welling, 2014)¶
The VAE treats as a random variable, not a deterministic code. The encoder outputs parameters of a distribution over :
The decoder defines the likelihood:
Training maximises the Evidence Lower Bound (ELBO):
Reconstruction term: how well does the decoder reconstruct from sampled ?
KL regularisation: keep the posterior close to the prior .
(KL divergence: ch295. Gaussian distribution: ch253. Expected value: ch249.)
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(z): return 1/(1+np.exp(-np.clip(z,-500,500)))
def relu(z): return np.maximum(0, z)
class VAE:
"""
Minimal VAE for 2D toy data.
Encoder: 2 → 64 → 64 → (mu, log_var) latent_dim
Decoder: latent_dim → 64 → 64 → 2
"""
def __init__(self, input_dim: int = 2, latent_dim: int = 2,
hidden_dim: int = 64, seed: int = 0):
rng = np.random.default_rng(seed)
s = lambda fi, fo: np.sqrt(2.0 / fi)
# Encoder weights
self.We1 = rng.normal(0, s(input_dim, hidden_dim), (hidden_dim, input_dim))
self.be1 = np.zeros(hidden_dim)
self.We2 = rng.normal(0, s(hidden_dim, hidden_dim), (hidden_dim, hidden_dim))
self.be2 = np.zeros(hidden_dim)
self.W_mu = rng.normal(0, 0.01, (latent_dim, hidden_dim))
self.b_mu = np.zeros(latent_dim)
self.W_logvar = rng.normal(0, 0.01, (latent_dim, hidden_dim))
self.b_logvar = np.zeros(latent_dim)
# Decoder weights
self.Wd1 = rng.normal(0, s(latent_dim, hidden_dim), (hidden_dim, latent_dim))
self.bd1 = np.zeros(hidden_dim)
self.Wd2 = rng.normal(0, s(hidden_dim, hidden_dim), (hidden_dim, hidden_dim))
self.bd2 = np.zeros(hidden_dim)
self.Wd3 = rng.normal(0, s(hidden_dim, input_dim), (input_dim, hidden_dim))
self.bd3 = np.zeros(input_dim)
self.latent_dim = latent_dim
def encode(self, x: np.ndarray) -> tuple:
"""x: (batch, input_dim). Returns (mu, log_var)."""
h = relu(x @ self.We1.T + self.be1)
h = relu(h @ self.We2.T + self.be2)
mu = h @ self.W_mu.T + self.b_mu
log_var = h @ self.W_logvar.T + self.b_logvar
return mu, log_var
def reparameterise(self, mu: np.ndarray, log_var: np.ndarray,
rng: np.random.Generator) -> np.ndarray:
"""Sample z = mu + eps * std using the reparameterisation trick."""
std = np.exp(0.5 * log_var)
eps = rng.standard_normal(mu.shape)
return mu + eps * std
def decode(self, z: np.ndarray) -> np.ndarray:
"""z: (batch, latent_dim). Returns reconstructed x."""
h = relu(z @ self.Wd1.T + self.bd1)
h = relu(h @ self.Wd2.T + self.bd2)
return h @ self.Wd3.T + self.bd3
def elbo_loss(self, x: np.ndarray, x_hat: np.ndarray,
mu: np.ndarray, log_var: np.ndarray) -> tuple:
"""ELBO = -reconstruction - KL. Returns (total_loss, recon, kl)."""
# Reconstruction: MSE (Gaussian likelihood)
recon = np.mean((x - x_hat)**2)
# KL divergence: closed form for Gaussian vs N(0,I)
# KL = -0.5 * sum(1 + log_var - mu^2 - exp(log_var))
kl = -0.5 * np.mean(1 + log_var - mu**2 - np.exp(log_var))
return recon + kl, recon, kl
# ── Train VAE on a 2D mixture of Gaussians ──
rng = np.random.default_rng(42)
n_train = 800
# Data: 4 Gaussian clusters arranged in a ring
centres = [(np.cos(a), np.sin(a)) for a in np.linspace(0, 2*np.pi, 5)[:-1]]
X_train = np.vstack([rng.multivariate_normal(c, 0.04*np.eye(2), n_train//4) for c in centres])
rng.shuffle(X_train)
vae = VAE(input_dim=2, latent_dim=2, hidden_dim=32)
lr = 3e-3
# Simple SGD training (full batch for clarity)
losses = []
for epoch in range(300):
mu, log_var = vae.encode(X_train)
z = vae.reparameterise(mu, log_var, rng)
x_hat = vae.decode(z)
loss, recon, kl = vae.elbo_loss(X_train, x_hat, mu, log_var)
losses.append(float(loss))
# Numerical gradient (simplified — in practice use autograd)
eps_grad = 1e-4
for W_name in ['We1','We2','W_mu','W_logvar','Wd1','Wd2','Wd3']:
W = getattr(vae, W_name)
for idx in np.ndindex(*W.shape):
if np.random.rand() > 0.05: continue # sparse sampling for speed
W[idx] += eps_grad
mu2, lv2 = vae.encode(X_train)
z2 = vae.reparameterise(mu2, lv2, rng)
x2 = vae.decode(z2)
l2, _, _ = vae.elbo_loss(X_train, x2, mu2, lv2)
W[idx] -= 2*eps_grad
mu3, lv3 = vae.encode(X_train)
z3 = vae.reparameterise(mu3, lv3, rng)
x3 = vae.decode(z3)
l3, _, _ = vae.elbo_loss(X_train, x3, mu3, lv3)
W[idx] += eps_grad
grad = (l2 - l3) / (2*eps_grad)
W[idx] -= lr * grad
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
axes[0].scatter(X_train[:,0], X_train[:,1], s=8, alpha=0.5, color='#3498db')
axes[0].set_title('Training data (4-cluster ring)'); axes[0].set_aspect('equal')
mu_enc, lv_enc = vae.encode(X_train)
axes[1].scatter(mu_enc[:,0], mu_enc[:,1], s=8, alpha=0.5, color='#e74c3c')
axes[1].set_title('Latent space (encoded means)'); axes[1].set_aspect('equal')
# Generate new samples from prior
z_sample = rng.standard_normal((200, 2))
x_gen = vae.decode(z_sample)
axes[2].scatter(x_gen[:,0], x_gen[:,1], s=8, alpha=0.5, color='#2ecc71')
axes[2].set_title('Generated samples (decoded from prior)')
axes[2].set_aspect('equal')
plt.suptitle('VAE: encode → latent space → decode → generate', fontsize=11)
plt.tight_layout()
plt.savefig('ch325_vae.png', dpi=120)
plt.show()
print(f"Final ELBO loss: {losses[-1]:.4f}")3. The reparameterisation trick¶
To backpropagate through sampling , we write:
The randomness is in , which does not depend on . Gradients flow through and normally.
4. KL term interpretation¶
The KL term penalises the encoder for making posteriors that differ from . This forces latent codes to be approximately standard normal — enabling interpolation and sampling.
The -VAE uses with , increasing the pressure toward a disentangled latent space where each dimension encodes an independent factor.
5. Summary¶
Autoencoders compress to latent codes; VAEs make latent space a structured probability distribution.
ELBO = reconstruction term + KL regularisation; maximising ELBO = maximising marginal likelihood.
Reparameterisation trick enables gradient flow through sampling.
At generation time: sample , decode to get new data.
6. Forward and backward references¶
Used here: KL divergence (ch295), Gaussian distribution (ch253), expectation (ch249), backpropagation (ch306), encoder-decoder architecture (ch319).
This will reappear in ch338 — Project: Variational Autoencoder, where a VAE is trained on MNIST digits and the latent space is visualised and sampled.