Chapter 176 — Matrix Calculus Introduction - Mathematics for Programmers

Prerequisites: ch155 (Matrix Transpose), ch154 (Matrix Multiplication), ch168 (Projection Matrices) Note: Full calculus (derivatives, gradients) is covered in Part VII (ch201–240). This chapter introduces only the matrix-specific notation and identities needed for neural network math in ch177–178. You will learn:
Derivatives of scalar functions with respect to vectors and matrices
The gradient of the squared loss ||Ax - b||² with respect to x
Jacobian and its role in backpropagation
Key matrix derivative identities used in deep learning Environment: Python 3.x, numpy, matplotlib

1. Concept¶

Calculus on scalars: df/dx is the rate of change of f as x changes. Matrix calculus extends this to functions whose inputs or outputs are vectors and matrices.

Four types of derivative:

Input	Output	Result	Name
scalar	scalar	scalar	ordinary derivative
vector	scalar	vector	gradient
vector	vector	matrix	Jacobian
matrix	scalar	matrix	matrix gradient

For neural networks, the critical case is: scalar loss function → gradient with respect to weight matrices. This gradient tells the optimizer which direction to move the weights.

Convention: This chapter uses the denominator layout convention (gradient is a column vector, same shape as input). Numerator layout (gradient is a row vector) is also common — be alert when reading papers.

2. Intuition & Mental Models¶

Gradient as steepest ascent: The gradient of f(x) at a point x is a vector pointing in the direction of steepest increase of f. To minimize f, move opposite to the gradient. This is gradient descent (introduced in ch212 — Gradient Descent).

Shape rule: The gradient ∂f/∂x always has the same shape as x. If x is an n-vector, ∂f/∂x is an n-vector. If W is an m×n matrix, ∂f/∂W is an m×n matrix. This “shape matches input” rule is the fastest sanity check.

Jacobian as stacked gradients: If f: ℝⁿ → ℝᵐ, the Jacobian J is m×n — row i is the gradient of the i-th output with respect to all inputs.

Recall from ch168 that the projection Ax lives in the column space of A. The gradient of ||Ax - b||² pulls x toward b along that column space.

3. Visualization¶

# --- Visualization: Gradient of ||Ax - b||^2 on a 2D grid ---
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-whitegrid')

# Simple 2x2 case: f(x) = ||Ax - b||^2
A = np.array([[2.0, 1.0], [1.0, 3.0]])
b = np.array([4.0, 5.0])
x_star = np.linalg.solve(A.T @ A, A.T @ b)  # optimal x

# Evaluate f on a grid
x1 = np.linspace(-1, 4, 80)
x2 = np.linspace(-1, 4, 80)
X1, X2 = np.meshgrid(x1, x2)
F = np.zeros_like(X1)
for i in range(X1.shape[0]):
    for j in range(X1.shape[1]):
        x = np.array([X1[i,j], X2[i,j]])
        r = A @ x - b
        F[i,j] = r @ r

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

# Contour plot
ax = axes[0]
cs = ax.contourf(X1, X2, F, levels=30, cmap='Blues')
plt.colorbar(cs, ax=ax)
ax.plot(*x_star, 'r*', markersize=15, label='Minimum x*')
ax.set_xlabel('x₁'); ax.set_ylabel('x₂')
ax.set_title('f(x) = ||Ax - b||² — loss landscape')
ax.legend()

# Gradient field
ax2 = axes[1]
x1g = np.linspace(-1, 4, 12)
x2g = np.linspace(-1, 4, 12)
X1g, X2g = np.meshgrid(x1g, x2g)
# Gradient = 2 A^T (Ax - b)
ATA2 = 2 * A.T @ A
ATb2 = 2 * A.T @ b
G1 = ATA2[0,0]*X1g + ATA2[0,1]*X2g - ATb2[0]
G2 = ATA2[1,0]*X1g + ATA2[1,1]*X2g - ATb2[1]
ax2.quiver(X1g, X2g, -G1, -G2, alpha=0.5)  # negative gradient = descent direction
cs2 = ax2.contour(X1, X2, F, levels=15, cmap='Blues', alpha=0.5)
ax2.plot(*x_star, 'r*', markersize=15)
ax2.set_xlabel('x₁'); ax2.set_ylabel('x₂')
ax2.set_title('Negative gradient field (descent directions)')

plt.tight_layout()
plt.show()

C:\Users\user\AppData\Local\Temp\ipykernel_20416\3976537204.py:49: UserWarning: Glyph 8321 (\N{SUBSCRIPT ONE}) missing from font(s) Arial.
  plt.tight_layout()
C:\Users\user\AppData\Local\Temp\ipykernel_20416\3976537204.py:49: UserWarning: Glyph 8322 (\N{SUBSCRIPT TWO}) missing from font(s) Arial.
  plt.tight_layout()
c:\Users\user\OneDrive\Documents\book\.venv\Lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 8322 (\N{SUBSCRIPT TWO}) missing from font(s) Arial.
  fig.canvas.print_figure(bytes_io, **kw)
c:\Users\user\OneDrive\Documents\book\.venv\Lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 8321 (\N{SUBSCRIPT ONE}) missing from font(s) Arial.
  fig.canvas.print_figure(bytes_io, **kw)

4. Mathematical Formulation¶

Key identities (denominator layout):

∂(aᵀx)/∂x = a
∂(xᵀAx)/∂x = (A + Aᵀ)x  [= 2Ax if A is symmetric]
∂(||Ax - b||²)/∂x = 2Aᵀ(Ax - b)
∂(Wx)/∂x = Wᵀ  [Jacobian]
∂(||x||²)/∂x = 2x

Matrix gradient:

∂(||Wx - b||²)/∂W = 2(Wx - b)xᵀ

Where W is m×n, x is n×1, b is m×1. The result is m×n (same shape as W).

Jacobian of linear layer:

y = Wx  →  ∂y/∂x = W  (m×n Jacobian)
y = Wx  →  ∂y/∂W = ? (see implementation)

Chain rule in matrix form:

∂L/∂x = (∂y/∂x)ᵀ · ∂L/∂y = Wᵀ · ∂L/∂y

# Worked numeric example: verify gradient analytically vs numerically
import numpy as np

def numerical_gradient(f, x, eps=1e-5):
    """
    Compute gradient of scalar f at x using central finite differences.

    Args:
        f: function R^n -> R
        x: (n,) array

    Returns:
        grad: (n,) numerical gradient
    """
    grad = np.zeros_like(x, dtype=float)
    for i in range(len(x)):
        x_plus = x.copy(); x_plus[i] += eps
        x_minus = x.copy(); x_minus[i] -= eps
        grad[i] = (f(x_plus) - f(x_minus)) / (2 * eps)
    return grad


A = np.array([[2.0, 1.0], [1.0, 3.0], [0.5, 2.0]])
b = np.array([4.0, 5.0, 2.0])
x0 = np.array([1.0, 2.0])

# f(x) = ||Ax - b||^2
f = lambda x: np.sum((A @ x - b)**2)

# Analytic gradient: 2 A^T (Ax - b)
analytic_grad = 2 * A.T @ (A @ x0 - b)
numerical_grad = numerical_gradient(f, x0)

print("Analytic gradient: ", analytic_grad)
print("Numerical gradient:", numerical_grad)
print(f"Max error: {np.max(np.abs(analytic_grad - numerical_grad)):.2e}")

Analytic gradient:  [ 6.5 22. ]
Numerical gradient: [ 6.5 22. ]
Max error: 2.46e-10

5. Python Implementation¶

# --- Implementation: Gradient checker (validate any analytic gradient) ---
import numpy as np

def gradient_check(f, grad_f, x, eps=1e-5, atol=1e-6):
    """
    Check analytic gradient against numerical gradient.

    Args:
        f: scalar function R^n -> R
        grad_f: gradient function R^n -> R^n
        x: evaluation point (flattened)
        eps: finite difference step
        atol: absolute tolerance

    Returns:
        passed: bool
        max_error: float
    """
    x = x.flatten()
    analytic = grad_f(x).flatten()
    numeric = numerical_gradient(f, x, eps)
    max_error = np.max(np.abs(analytic - numeric))
    return max_error < atol, max_error


# Test 1: f(x) = x^T A x (quadratic form)
A = np.array([[3.0, 1.0], [1.0, 2.0]])
f1 = lambda x: x @ A @ x
grad_f1 = lambda x: (A + A.T) @ x  # 2Ax for symmetric A

x_test = np.array([2.0, -1.0])
passed, err = gradient_check(f1, grad_f1, x_test)
print(f"Quadratic form gradient check: {'PASS' if passed else 'FAIL'} (max error {err:.2e})")

# Test 2: f(x) = ||Wx - b||^2, gradient w.r.t. W
W = np.random.randn(3, 4)
x_vec = np.random.randn(4)
b_vec = np.random.randn(3)

f2 = lambda w_flat: np.sum((w_flat.reshape(3,4) @ x_vec - b_vec)**2)
grad_f2 = lambda w_flat: (2 * np.outer(W.reshape(3,4) @ x_vec - b_vec, x_vec)).flatten()

passed2, err2 = gradient_check(f2, grad_f2, W)
print(f"Matrix gradient check:        {'PASS' if passed2 else 'FAIL'} (max error {err2:.2e})")

Quadratic form gradient check: PASS (max error 1.54e-10)
Matrix gradient check:        PASS (max error 2.10e-10)

6. Experiments¶

# --- Experiment 1: Gradient descent on ||Ax - b||^2 ---
# Hypothesis: Following the negative gradient converges to the least-squares solution
# Try changing: LEARNING_RATE

import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-whitegrid')

LEARNING_RATE = 0.05  # try: 0.001, 0.01, 0.05, 0.2 (instability!)
N_STEPS = 200

np.random.seed(0)
A = np.random.randn(10, 3)
b = np.random.randn(10)
x = np.zeros(3)

x_star = np.linalg.lstsq(A, b, rcond=None)[0]

losses = []
for _ in range(N_STEPS):
    residual = A @ x - b
    loss = residual @ residual
    grad = 2 * A.T @ residual
    x = x - LEARNING_RATE * grad
    losses.append(loss)

plt.figure(figsize=(8, 4))
plt.semilogy(losses)
plt.axhline(np.sum((A @ x_star - b)**2), color='red', ls='--', label='Optimal loss')
plt.xlabel('Iteration')
plt.ylabel('Loss (log scale)')
plt.title(f'Gradient descent on ||Ax-b||² (lr={LEARNING_RATE})')
plt.legend()
plt.tight_layout()
plt.show()
print(f"Final loss: {losses[-1]:.6f}")
print(f"Optimal loss: {np.sum((A @ x_star - b)**2):.6f}")

Final loss: 3.285170
Optimal loss: 3.285170

7. Exercises¶

Easy 1. Verify analytically and numerically: ∂(||x||²)/∂x = 2x. Use gradient_check on a 5-dimensional example.

Easy 2. What is the gradient of f(x) = cᵀx with respect to x (where c is a constant vector)? Verify numerically.

Medium 1. Derive the gradient of f(W) = ||WX - Y||_F² with respect to W, where X is n×p, Y is m×p, and W is m×n. Verify your result with gradient_check on a concrete 3×4 example.

Medium 2. Implement gradient descent to minimize the regularized loss f(x) = ||Ax - b||² + λ||x||². The analytic gradient is 2Aᵀ(Ax-b) + 2λx. Compare the solution to the closed form: x* = (AᵀA + λI)⁻¹Aᵀb. Try λ = 0, 0.1, 1.0, 10.

Hard. The cross-entropy loss is L = -Σᵢ yᵢ log(softmax(Wx)ᵢ). Derive ∂L/∂(Wx) (the gradient of cross-entropy w.r.t. the pre-softmax scores). Verify numerically using finite differences.

8. Mini Project: Least Squares via Gradient Descent¶

# --- Mini Project: Matrix calculus in action — linear regression by gradient descent ---
# Problem: Fit a linear model y = Xw to data using gradient descent,
#          and compare to the closed-form solution from ch182.

import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-whitegrid')

np.random.seed(42)

# Generate synthetic regression data
n, p = 100, 5
X = np.random.randn(n, p)
w_true = np.array([2.0, -1.5, 0.8, 3.0, -0.5])
y = X @ w_true + 0.5 * np.random.randn(n)

# Closed-form solution: w* = (X^T X)^{-1} X^T y
w_closed = np.linalg.solve(X.T @ X, X.T @ y)

# Gradient descent: grad_w = 2/n X^T (Xw - y)
LR = 0.1
N_ITER = 500
w_gd = np.zeros(p)
losses_gd = []

for _ in range(N_ITER):
    residual = X @ w_gd - y
    loss = np.mean(residual**2)
    grad = (2/n) * X.T @ residual
    w_gd = w_gd - LR * grad
    losses_gd.append(loss)

fig, axes = plt.subplots(1, 2, figsize=(13, 4))

axes[0].semilogy(losses_gd, color='steelblue')
axes[0].axhline(np.mean((X @ w_closed - y)**2), color='red', ls='--', label='Closed-form MSE')
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('MSE (log)')
axes[0].set_title('Gradient descent on MSE loss')
axes[0].legend()

axes[1].bar(range(p), w_true, alpha=0.7, label='True w', color='steelblue')
axes[1].bar(range(p), w_closed, alpha=0.5, label='Closed-form', color='green')
axes[1].bar(range(p), w_gd, alpha=0.4, label='GD', color='tomato')
axes[1].set_xlabel('Feature')
axes[1].set_ylabel('Weight value')
axes[1].set_title('Weight comparison')
axes[1].legend()

plt.tight_layout()
plt.show()
print(f"||w_closed - w_true||: {np.linalg.norm(w_closed - w_true):.4f}")
print(f"||w_gd - w_true||:     {np.linalg.norm(w_gd - w_true):.4f}")

||w_closed - w_true||: 0.0983
||w_gd - w_true||:     0.0983

9. Chapter Summary & Connections¶

Matrix calculus extends derivatives to vector and matrix inputs/outputs. Shape of gradient = shape of input.
Key identity: ∂(||Ax-b||²)/∂x = 2Aᵀ(Ax-b) — the gradient of the squared loss.
∂(Wx)/∂x = Wᵀ — the Jacobian of a linear layer is its transpose.
Numerical gradient checking via finite differences is the standard validation tool for analytic gradients.
Gradient descent on the least-squares loss converges to the same solution as the closed-form normal equations.

Backward: Uses ch155 (Transpose) and ch168 (Projection) throughout. The least-squares problem in the mini project connects to ch182.

Forward:

ch177 (Linear Algebra for Neural Networks): the chain rule in matrix form — backpropagation
ch178 (Linear Layers in Deep Learning): implementing a neural layer from scratch using these gradients
ch212 (Gradient Descent): full treatment of optimization via gradients in Part VII
ch216 (Backpropagation): the chain rule applied through arbitrary computational graphs

Going deeper: The Matrix Cookbook (Petersen & Pedersen) is the reference for matrix derivative identities. For autodiff, see ch208 (Automatic Differentiation) in Part VII.