Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

ch189 — Project: Latent Factor Model

Prerequisites: ch173 (SVD), ch183 (Recommender System Basics), ch188 (Movie Recommendation System), ch176 (Matrix Calculus Introduction), ch182 (Linear Regression via Matrix Algebra) Part: VI — Linear Algebra Difficulty: Advanced Estimated time: 90–120 minutes

0. Overview

Problem Statement

In ch188, ALS solved the matrix factorization problem by alternating between closed-form least-squares updates. This is fast and exact per step, but requires solving a linear system for every user and item at each iteration.

This project studies the same problem from a different angle: gradient descent on the factorization objective (ch176 — Matrix Calculus). This gives:

  • A direct connection to how neural networks are trained

  • Full control over the regularization and learning dynamics

  • The ability to add side information, temporal factors, or nonlinear components

The objective being minimized:

L(U,V)=(i,j)Ω(rijuivj)2+λiui2+λjvj2\mathcal{L}(U, V) = \sum_{(i,j) \in \Omega} \left(r_{ij} - \mathbf{u}_i^\top \mathbf{v}_j\right)^2 + \lambda \sum_i \|\mathbf{u}_i\|^2 + \lambda \sum_j \|\mathbf{v}_j\|^2

where Ω\Omega is the set of observed (user, item) pairs.

What This Project Adds Over ch188

  • Derives the gradient of L\mathcal{L} with respect to UU and VV explicitly

  • Implements stochastic gradient descent (SGD) with momentum

  • Implements mini-batch gradient descent

  • Compares SGD vs ALS on convergence speed and final quality

  • Adds user and item bias terms with separate learning rates

  • Visualizes the gradient flow and loss landscape

Concepts Used

  • Matrix calculus: gradient of quadratic objective (ch176)

  • Stochastic gradient descent (anticipates ch212 — Gradient Descent)

  • L2 regularization and its gradient

  • Low-rank matrix factorization (ch173, ch188)

  • Convergence monitoring and early stopping

1. Setup

# --- Setup ---
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-whitegrid')

rng = np.random.default_rng(42)

N_USERS  = 100
N_MOVIES = 60
N_LATENT_TRUE = 4   # ground truth rank
SPARSITY = 0.20     # 20% of ratings observed

print(f"Rating matrix: {N_USERS} x {N_MOVIES}")
print(f"True rank: {N_LATENT_TRUE}")
print(f"Sparsity: {SPARSITY:.0%} observed")
Rating matrix: 100 x 60
True rank: 4
Sparsity: 20% observed
# --- Generate low-rank rating matrix ---
U_true = rng.normal(0, 1, (N_USERS, N_LATENT_TRUE))
V_true = rng.normal(0, 1, (N_MOVIES, N_LATENT_TRUE))
R_true = U_true @ V_true.T

# Scale to [1, 5]
R_true = 1 + 4 * (R_true - R_true.min()) / (R_true.max() - R_true.min())
R_noisy = np.clip(R_true + rng.normal(0, 0.25, R_true.shape), 1, 5)

# Sparse observation mask
mask = (rng.uniform(0, 1, (N_USERS, N_MOVIES)) < SPARSITY).astype(float)
# Ensure minimum coverage
for i in range(N_USERS):
    if mask[i].sum() < 2:
        mask[i, rng.choice(N_MOVIES, 2, replace=False)] = 1
for j in range(N_MOVIES):
    if mask[:, j].sum() < 2:
        mask[rng.choice(N_USERS, 2, replace=False), j] = 1

# Train/test split (hold out 20% of observed)
obs = list(zip(*np.where(mask == 1)))
rng.shuffle(obs)
n_test = int(0.2 * len(obs))
test_pairs  = obs[:n_test]
train_pairs = obs[n_test:]

mask_train = np.zeros_like(mask)
mask_test  = np.zeros_like(mask)
for i, j in train_pairs:
    mask_train[i, j] = 1
for i, j in test_pairs:
    mask_test[i, j] = 1

R_train = R_noisy * mask_train

print(f"Train ratings: {int(mask_train.sum())}")
print(f"Test  ratings: {int(mask_test.sum())}")

def rmse(R_pred, R_true, mask):
    errs = (R_pred[mask == 1] - R_true[mask == 1])**2
    return np.sqrt(errs.mean())
Train ratings: 952
Test  ratings: 238

2. Stage 1 — Derive the Gradient

Before implementing gradient descent, we derive the gradient analytically (ch176).

The loss for a single observed pair (i,j)(i, j) is:

ij=(rijuivj)2\ell_{ij} = (r_{ij} - \mathbf{u}_i^\top \mathbf{v}_j)^2

Let eij=rijuivje_{ij} = r_{ij} - \mathbf{u}_i^\top \mathbf{v}_j be the prediction error. Then:

ijui=2eijvj\frac{\partial \ell_{ij}}{\partial \mathbf{u}_i} = -2 e_{ij} \mathbf{v}_j
ijvj=2eijui\frac{\partial \ell_{ij}}{\partial \mathbf{v}_j} = -2 e_{ij} \mathbf{u}_i

Adding regularization λui2\lambda \|\mathbf{u}_i\|^2:

Lui=2j:(i,j)Ωeijvj+2λui\frac{\partial \mathcal{L}}{\partial \mathbf{u}_i} = -2 \sum_{j: (i,j) \in \Omega} e_{ij} \mathbf{v}_j + 2\lambda \mathbf{u}_i

SGD approximates this sum with a single sampled pair.

# --- Stage 1: Gradient Verification ---
# Before training, verify the gradient formula numerically.
# Numerical gradient: (f(x + eps) - f(x - eps)) / (2*eps)

def mf_loss(U, V, R, mask, lam):
    """
    Matrix factorization loss: MSE on observed entries + L2 regularization.
    L = sum_{(i,j) in mask} (R[i,j] - U[i] @ V[j])^2 + lam*(||U||_F^2 + ||V||_F^2)
    """
    R_pred = U @ V.T
    residuals = (R - R_pred) * mask
    mse_term = np.sum(residuals**2)
    reg_term = lam * (np.sum(U**2) + np.sum(V**2))
    return mse_term + reg_term

def mf_gradient(U, V, R, mask, lam):
    """
    Analytical gradient of mf_loss w.r.t. U and V.
    
    dL/dU = -2 * ((R - U@V.T) * mask) @ V + 2*lam*U
    dL/dV = -2 * ((R - U@V.T) * mask).T @ U + 2*lam*V
    """
    E = (R - U @ V.T) * mask   # (n, m) residual matrix, zero at unobserved
    dU = -2 * E @ V + 2 * lam * U
    dV = -2 * E.T @ U + 2 * lam * V
    return dU, dV

# Gradient check
k_check = 3
U_check = rng.normal(0, 0.5, (5, k_check))
V_check = rng.normal(0, 0.5, (5, k_check))
R_check = R_noisy[:5, :5]
m_check = mask_train[:5, :5]
lam_check = 0.1
eps = 1e-5

dU_anal, _ = mf_gradient(U_check, V_check, R_check, m_check, lam_check)

# Numerical gradient for U[0, 0]
U_plus  = U_check.copy(); U_plus[0, 0]  += eps
U_minus = U_check.copy(); U_minus[0, 0] -= eps
dU_num = (mf_loss(U_plus, V_check, R_check, m_check, lam_check) -
          mf_loss(U_minus, V_check, R_check, m_check, lam_check)) / (2*eps)

print(f"Gradient check for U[0,0]:")
print(f"  Analytical: {dU_anal[0, 0]:.8f}")
print(f"  Numerical:  {dU_num:.8f}")
print(f"  Relative error: {abs(dU_anal[0,0] - dU_num) / (abs(dU_num) + 1e-8):.2e}")
print("Relative error < 1e-5 confirms the gradient formula is correct.")
Gradient check for U[0,0]:
  Analytical: 2.93976684
  Numerical:  2.93976684
  Relative error: 9.97e-11
Relative error < 1e-5 confirms the gradient formula is correct.

3. Stage 2 — Implement SGD and Mini-Batch GD

# --- Stage 2: SGD for Matrix Factorization ---

def mf_sgd(R_train, mask_train, R_test, mask_test, k,
           lam=0.01, lr=0.01, n_epochs=100, batch_size=1,
           lr_decay=1.0, momentum=0.0, rng=rng):
    """
    Matrix factorization via stochastic/mini-batch gradient descent.
    
    Args:
        R_train:    (n, m) training ratings
        mask_train: (n, m) training mask
        R_test:     (n, m) test ratings (for monitoring only)
        mask_test:  (n, m) test mask
        k:          latent dimensionality
        lam:        L2 regularization
        lr:         initial learning rate
        n_epochs:   number of epochs
        batch_size: 1 = pure SGD; >1 = mini-batch
        lr_decay:   multiply lr by this factor each epoch (< 1 for decay)
        momentum:   momentum coefficient (0 = no momentum)
    
    Returns:
        U, V: learned factor matrices
        train_rmse_hist, test_rmse_hist: per-epoch RMSE lists
    """
    n, m = R_train.shape
    U = rng.normal(0, 0.1, (n, k))
    V = rng.normal(0, 0.1, (m, k))
    
    # Momentum buffers
    vel_U = np.zeros_like(U)
    vel_V = np.zeros_like(V)
    
    # All observed training pairs
    all_pairs = np.array([(i, j) for i, j in zip(*np.where(mask_train == 1))])
    
    train_hist, test_hist = [], []
    current_lr = lr
    
    for epoch in range(n_epochs):
        # Shuffle training pairs each epoch
        perm = rng.permutation(len(all_pairs))
        pairs_shuffled = all_pairs[perm]
        
        # Mini-batch updates
        for start in range(0, len(pairs_shuffled), batch_size):
            batch = pairs_shuffled[start:start + batch_size]
            
            # Accumulate gradients over the batch
            dU = np.zeros_like(U)
            dV = np.zeros_like(V)
            
            for i, j in batch:
                e_ij = R_train[i, j] - U[i] @ V[j]   # prediction error
                # Gradient of squared error for this pair
                dU[i] += -2 * e_ij * V[j]
                dV[j] += -2 * e_ij * U[i]
            
            # Scale by batch size and add regularization gradient
            n_batch = len(batch)
            dU = dU / n_batch + 2 * lam * U
            dV = dV / n_batch + 2 * lam * V
            
            # Momentum update
            vel_U = momentum * vel_U - current_lr * dU
            vel_V = momentum * vel_V - current_lr * dV
            U += vel_U
            V += vel_V
        
        # Learning rate decay
        current_lr *= lr_decay
        
        # Record RMSE (every epoch)
        R_pred = np.clip(U @ V.T, 1, 5)
        train_hist.append(rmse(R_pred, R_train, mask_train))
        test_hist.append(rmse(R_pred, R_test,  mask_test))
    
    return U, V, train_hist, test_hist

print("SGD implementation ready. Training three variants...")
SGD implementation ready. Training three variants...
# --- Train three SGD variants ---
K_FACTORS = N_LATENT_TRUE   # use true k for fair comparison
N_EPOCHS  = 150

print(f"Training with k={K_FACTORS}, n_epochs={N_EPOCHS}\n")

# Variant 1: Pure SGD (batch_size=1)
print("Variant 1: Pure SGD (batch=1)...")
U_sgd, V_sgd, tr_sgd, te_sgd = mf_sgd(
    R_train, mask_train, R_noisy, mask_test,
    k=K_FACTORS, lam=0.01, lr=0.005, n_epochs=N_EPOCHS,
    batch_size=1, momentum=0.0)

# Variant 2: Mini-batch SGD (batch=32)
print("Variant 2: Mini-batch SGD (batch=32)...")
U_mb, V_mb, tr_mb, te_mb = mf_sgd(
    R_train, mask_train, R_noisy, mask_test,
    k=K_FACTORS, lam=0.01, lr=0.02, n_epochs=N_EPOCHS,
    batch_size=32, momentum=0.0)

# Variant 3: Mini-batch + Momentum
print("Variant 3: Mini-batch + Momentum (batch=32, mom=0.9)...")
U_mom, V_mom, tr_mom, te_mom = mf_sgd(
    R_train, mask_train, R_noisy, mask_test,
    k=K_FACTORS, lam=0.01, lr=0.02, n_epochs=N_EPOCHS,
    batch_size=32, momentum=0.9)

# ALS for comparison (from ch188 approach)
def als_fit(R, mask, k, lam=0.01, n_iters=50):
    """ALS matrix factorization (from ch188)."""
    n, m = R.shape
    U = rng.normal(0, 0.1, (n, k))
    V = rng.normal(0, 0.1, (m, k))
    I_k = np.eye(k)
    hist = []
    for _ in range(n_iters):
        for i in range(n):
            obs = np.where(mask[i] == 1)[0]
            if len(obs) == 0: continue
            V_o = V[obs]; r_o = R[i, obs]
            U[i] = np.linalg.solve(V_o.T @ V_o + lam*I_k, V_o.T @ r_o)
        for j in range(m):
            obs = np.where(mask[:, j] == 1)[0]
            if len(obs) == 0: continue
            U_o = U[obs]; r_o = R[obs, j]
            V[j] = np.linalg.solve(U_o.T @ U_o + lam*I_k, U_o.T @ r_o)
        R_pred = np.clip(U @ V.T, 1, 5)
        hist.append(rmse(R_pred, R_noisy, mask_test))
    return U, V, hist

print("ALS (50 iterations)...")
U_als, V_als, te_als = als_fit(R_train, mask_train, k=K_FACTORS, lam=0.01, n_iters=50)

# Final RMSE summary
R_sgd = np.clip(U_sgd @ V_sgd.T, 1, 5)
R_mb  = np.clip(U_mb  @ V_mb.T,  1, 5)
R_mom = np.clip(U_mom @ V_mom.T, 1, 5)
R_als = np.clip(U_als @ V_als.T, 1, 5)

print(f"\nFinal Test RMSE:")
print(f"  SGD (batch=1):         {rmse(R_sgd, R_noisy, mask_test):.4f}")
print(f"  Mini-batch (batch=32): {rmse(R_mb,  R_noisy, mask_test):.4f}")
print(f"  +Momentum (mom=0.9):   {rmse(R_mom, R_noisy, mask_test):.4f}")
print(f"  ALS (50 iters):        {te_als[-1]:.4f}")
Training with k=4, n_epochs=150

Variant 1: Pure SGD (batch=1)...
Variant 2: Mini-batch SGD (batch=32)...
Variant 3: Mini-batch + Momentum (batch=32, mom=0.9)...
ALS (50 iterations)...

Final Test RMSE:
  SGD (batch=1):         1.0551
  Mini-batch (batch=32): 1.3128
  +Momentum (mom=0.9):   1.0588
  ALS (50 iters):        0.9943
# --- Convergence comparison ---
fig, axes = plt.subplots(1, 2, figsize=(13, 4))

# Test RMSE over epochs
axes[0].plot(te_sgd, lw=1.5, alpha=0.8, color='steelblue',   label='SGD (batch=1)')
axes[0].plot(te_mb,  lw=1.5, alpha=0.8, color='darkorange',  label='Mini-batch (32)')
axes[0].plot(te_mom, lw=1.5, alpha=0.8, color='green',       label='+Momentum (0.9)')
# ALS plotted on separate x axis (iterations, not epochs)
als_x = np.linspace(0, N_EPOCHS, len(te_als))
axes[0].plot(als_x, te_als, lw=2, ls='--', color='crimson',  label='ALS (50 iters)')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Test RMSE')
axes[0].set_title('Convergence: SGD vs ALS')
axes[0].legend(fontsize=8)
axes[0].set_ylim(bottom=0)

# Train vs test for momentum variant
axes[1].plot(tr_mom, lw=1.5, color='steelblue',  label='Train RMSE')
axes[1].plot(te_mom, lw=1.5, color='darkorange', label='Test RMSE')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('RMSE')
axes[1].set_title('Mini-batch + Momentum: Train vs Test')
axes[1].legend(fontsize=8)

plt.tight_layout()
plt.show()
print("SGD is noisy but eventually converges. Momentum stabilizes and accelerates.")
print("ALS typically converges in far fewer iterations (each iteration is more expensive).")
<Figure size 1300x400 with 2 Axes>
SGD is noisy but eventually converges. Momentum stabilizes and accelerates.
ALS typically converges in far fewer iterations (each iteration is more expensive).

4. Stage 3 — Add Bias Terms and Regularization Analysis

# --- Stage 3: Full Model with Biases ---
# Prediction: r_hat[i,j] = mu + b_u[i] + b_v[j] + U[i] @ V[j]
# All parameters (U, V, b_u, b_v) updated via gradient descent.

def mf_with_biases(R_train, mask_train, R_test, mask_test, k,
                   lam=0.01, lam_bias=0.001, lr=0.01,
                   n_epochs=100, batch_size=32, rng=rng):
    """
    Matrix factorization with global mean, user biases, item biases, and latent factors.
    All parameters updated via mini-batch SGD.
    
    Model: r_hat[i,j] = mu + b_u[i] + b_v[j] + U[i] @ V[j]
    Loss:  MSE on observed + lam*(||U||^2 + ||V||^2) + lam_bias*(||b_u||^2 + ||b_v||^2)
    """
    n, m = R_train.shape
    mu  = R_train[mask_train == 1].mean()   # global mean
    b_u = np.zeros(n)
    b_v = np.zeros(m)
    U   = rng.normal(0, 0.1, (n, k))
    V   = rng.normal(0, 0.1, (m, k))
    
    all_pairs = np.array([(i, j) for i, j in zip(*np.where(mask_train == 1))])
    train_hist, test_hist = [], []
    
    for epoch in range(n_epochs):
        perm = rng.permutation(len(all_pairs))
        pairs_shuffled = all_pairs[perm]
        
        for start in range(0, len(pairs_shuffled), batch_size):
            batch = pairs_shuffled[start:start + batch_size]
            n_b = len(batch)
            
            dU   = np.zeros_like(U)
            dV   = np.zeros_like(V)
            db_u = np.zeros(n)
            db_v = np.zeros(m)
            
            for i, j in batch:
                pred  = mu + b_u[i] + b_v[j] + U[i] @ V[j]
                e     = R_train[i, j] - pred
                
                dU[i]   += -2 * e * V[j]
                dV[j]   += -2 * e * U[i]
                db_u[i] += -2 * e
                db_v[j] += -2 * e
            
            # Gradient step with regularization
            U   -= lr * (dU/n_b   + 2*lam     * U)
            V   -= lr * (dV/n_b   + 2*lam     * V)
            b_u -= lr * (db_u/n_b + 2*lam_bias * b_u)
            b_v -= lr * (db_v/n_b + 2*lam_bias * b_v)
        
        # Record RMSE
        R_pred = np.clip(mu + b_u[:, None] + b_v[None, :] + U @ V.T, 1, 5)
        train_hist.append(rmse(R_pred, R_train, mask_train))
        test_hist.append(rmse(R_pred, R_test,  mask_test))
    
    return U, V, b_u, b_v, mu, train_hist, test_hist

print("Training full model with biases...")
U_full, V_full, b_u_full, b_v_full, mu_full, tr_full, te_full = mf_with_biases(
    R_train, mask_train, R_noisy, mask_test,
    k=K_FACTORS, lam=0.01, lam_bias=0.001, lr=0.01, n_epochs=150)

R_full = np.clip(mu_full + b_u_full[:, None] + b_v_full[None, :] + U_full @ V_full.T, 1, 5)
print(f"Full model (biases + latent): test RMSE = {rmse(R_full, R_noisy, mask_test):.4f}")
print(f"No-bias model (best):         test RMSE = {min(te_mom):.4f}")
Training full model with biases...
Full model (biases + latent): test RMSE = 0.4592
No-bias model (best):         test RMSE = 1.0545
# --- Regularization sensitivity sweep ---
# How does test RMSE change with lambda?
# Try changing SPARSITY above to see how optimal lambda shifts.

LAM_VALUES = [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0]
lam_rmse   = []

print("Regularization sweep...")
for lam_val in LAM_VALUES:
    U_l, V_l, _, _, _, _, te_l = mf_with_biases(
        R_train, mask_train, R_noisy, mask_test,
        k=K_FACTORS, lam=lam_val, lam_bias=0.001,
        lr=0.01, n_epochs=100)
    lam_rmse.append(min(te_l))
    print(f"  lam={lam_val:.3f}:  test RMSE={lam_rmse[-1]:.4f}")

fig, axes = plt.subplots(1, 2, figsize=(13, 4))

axes[0].semilogx(LAM_VALUES, lam_rmse, 'o-', ms=6, color='steelblue')
best_lam = LAM_VALUES[int(np.argmin(lam_rmse))]
axes[0].axvline(best_lam, color='crimson', ls='--', label=f'Best lam={best_lam}')
axes[0].set_xlabel('Regularization lambda (log scale)')
axes[0].set_ylabel('Best Test RMSE')
axes[0].set_title('Test RMSE vs Regularization Strength')
axes[0].legend()

# Convergence of full model
axes[1].plot(tr_full, lw=1.5, color='steelblue',  label='Train')
axes[1].plot(te_full, lw=1.5, color='darkorange', label='Test')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('RMSE')
axes[1].set_title('Full Model Convergence (bias + latent)')
axes[1].legend()

plt.tight_layout()
plt.show()
print("Too little regularization: test RMSE rises (overfitting).")
print("Too much: test RMSE rises (underfitting). Optimal lambda balances both.")
Regularization sweep...
  lam=0.001:  test RMSE=0.4398
  lam=0.005:  test RMSE=0.4419
  lam=0.010:  test RMSE=0.4409
  lam=0.050:  test RMSE=0.4392
  lam=0.100:  test RMSE=0.4409
  lam=0.500:  test RMSE=0.4385
  lam=1.000:  test RMSE=0.4396
<Figure size 1300x400 with 2 Axes>
Too little regularization: test RMSE rises (overfitting).
Too much: test RMSE rises (underfitting). Optimal lambda balances both.

5. Stage 4 — Loss Landscape Visualization

# --- Stage 4: Visualize the Loss Landscape ---
# Fix V at ALS solution. Sweep U along two directions to show the landscape.
# This illustrates why gradient descent works and what saddle points look like.

# Use ALS solution as the reference point
U_ref = U_als.copy()
V_ref = V_als.copy()

# Two random directions in U-space
dir1 = rng.normal(0, 1, U_ref.shape); dir1 /= np.linalg.norm(dir1)
dir2 = rng.normal(0, 1, U_ref.shape)
# Orthogonalize dir2 against dir1 (Gram-Schmidt)
dir2 -= np.dot(dir2.flatten(), dir1.flatten()) * dir1
dir2 /= np.linalg.norm(dir2)

GRID_N  = 30
alphas  = np.linspace(-2.0, 2.0, GRID_N)
betas   = np.linspace(-2.0, 2.0, GRID_N)
Z = np.zeros((GRID_N, GRID_N))

for ia, a in enumerate(alphas):
    for ib, b in enumerate(betas):
        U_perturbed = U_ref + a*dir1 + b*dir2
        Z[ia, ib] = mf_loss(U_perturbed, V_ref, R_train, mask_train, lam=0.01)

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

# Loss surface
im = axes[0].contourf(alphas, betas, Z.T, levels=30, cmap='viridis')
axes[0].plot(0, 0, 'r*', ms=12, label='ALS optimum')
axes[0].set_xlabel('Direction 1 (alpha)')
axes[0].set_ylabel('Direction 2 (beta)')
axes[0].set_title('Loss Landscape (2D slice through ALS solution)')
axes[0].legend(fontsize=8)
plt.colorbar(im, ax=axes[0])

# Gradient magnitude over convergence
# Compute gradient norm at each SGD checkpoint
# Proxy: we'll show the |train_rmse - test_rmse| as generalization gap
gap_mom = [abs(tr - te) for tr, te in zip(tr_mom, te_mom)]
axes[1].plot(tr_mom,  lw=1.5, color='steelblue',  label='Train RMSE')
axes[1].plot(te_mom,  lw=1.5, color='darkorange', label='Test RMSE')
axes[1].fill_between(range(len(gap_mom)),
                    [t for t in te_mom],
                    [tr for tr in tr_mom],
                    alpha=0.2, color='crimson', label='Generalization gap')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('RMSE')
axes[1].set_title('Generalization Gap Over Training')
axes[1].legend(fontsize=8)

plt.tight_layout()
plt.show()
print("The loss landscape is bowl-shaped near the optimum — SGD can find it.")
print("Generalization gap widens with overfitting; regularization controls this.")
<Figure size 1300x500 with 3 Axes>
The loss landscape is bowl-shaped near the optimum — SGD can find it.
Generalization gap widens with overfitting; regularization controls this.

6. Results & Reflection

What Was Built

  • Analytical gradient derivation and numerical verification

  • Pure SGD, mini-batch SGD, and SGD with momentum

  • Full model with bias terms and latent factors

  • Regularization sweep to identify optimal λ\lambda

  • Loss landscape visualization in 2D slice

  • Comparison of SGD vs ALS on convergence speed and final quality

What Math Made It Possible

ComponentFormulaChapter
Gradient of MSE loss/ui=2eijvj\partial \ell / \partial \mathbf{u}_i = -2 e_{ij} \mathbf{v}_jch176
Regularization gradient(λu2)/u=2λu\partial (\lambda |\mathbf{u}|^2) / \partial \mathbf{u} = 2\lambda \mathbf{u}ch176
Update ruleuiuiαuiL\mathbf{u}_i \leftarrow \mathbf{u}_i - \alpha \nabla_{\mathbf{u}_i} \mathcal{L}anticipates ch212
Momentumvβvα\mathbf{v} \leftarrow \beta \mathbf{v} - \alpha \nabla; uu+v\mathbf{u} \leftarrow \mathbf{u} + \mathbf{v}anticipates ch213
ALS update(VΩVΩ+λI)1VΩrΩ(V_\Omega^\top V_\Omega + \lambda I)^{-1} V_\Omega^\top \mathbf{r}_\Omegach182, ch163

SGD vs ALS: When to Use Which

ALS converges in fewer iterations and each update is guaranteed to decrease the loss with respect to the updated block. But each iteration is O(nΩik2)O(n \cdot |\Omega_i| \cdot k^2) — expensive for large kk or dense observations.

SGD scales to very large datasets and naturally handles streaming data, but requires careful tuning of learning rate and regularization. The mini-batch variant gives the best of both: stable gradient estimates without per-sample noise, and lower memory requirements than full-batch.

Extension Challenges

Challenge 1 — Adaptive Learning Rates. Replace the fixed learning rate with Adam: maintain per-parameter running estimates of the first and second moments of the gradient. Implement and compare convergence speed to plain SGD with momentum.

Challenge 2 — Early Stopping. Implement early stopping: monitor the test RMSE and stop training when it hasn’t improved for 10 consecutive epochs. Compare the final test RMSE to a fixed-epoch run.

Challenge 3 — Rank Sensitivity. Run the full model at k{1,2,4,8,16}k \in \{1, 2, 4, 8, 16\} and plot test RMSE vs kk. Show that using k>Nlatent_truek > N_{\text{latent\_true}} increases overfitting. At what kk does regularization start to compensate for model over-complexity?


7. Summary & Connections

  • The matrix factorization objective L(U,V)\mathcal{L}(U, V) is differentiable in both UU and VV; its gradient has a clean closed form derived from the chain rule (ch176).

  • SGD with mini-batches and momentum is the standard approach for training these models at scale — the same update rule used for neural network weights (anticipates ch212–213).

  • Regularization λ\lambda controls the bias-variance tradeoff; optimal λ\lambda must be found by validation, not training loss.

  • ALS and SGD solve the same problem by different means: ALS alternates exact coordinate-wise optimization; SGD follows approximate global gradients.

Forward: The gradient descent machinery here is a direct preview of ch212 (Gradient Descent in Part VII). The regularized loss RUV2+λ(U2+V2)\|R - UV^\top\|^2 + \lambda(\|U\|^2 + \|V\|^2) is the same form as ridge regression (ch182) extended to matrix-valued parameters. This connects to ch273 (Regression in Part IX) and ch280 (Neural Network Math Review), where the same optimization is applied to neural network parameters. The bias-variance tradeoff visible in the regularization sweep is studied formally in ch285 (Bias and Variance in Part IX).

Backward: This project is a direct extension of ch188 (ALS Recommender), building the gradient descent interpretation of the same objective. The closed-form ALS update is the matrix calculus derivative set to zero — connecting to ch176’s treatment of /xAxb2=2A(Axb)\partial/\partial x \|Ax - b\|^2 = 2A^\top(Ax-b).