ch214 — Saddle Points - Mathematics for Programmers

Part VII: Calculus

1. What a Saddle Point Is¶

A saddle point is a stationary point where the gradient is zero, but it is neither a local minimum nor a local maximum. The function curves upward in some directions and downward in others.

Named after the shape of a horse saddle: a flat surface that is a minimum along the horse’s spine and a maximum across the saddle.

In high dimensions, saddle points are far more common than local minima. For a random function in n dimensions at a stationary point, each of the n principal curvature directions is independently positive or negative with some probability p. The probability that all n are positive (a true minimum) decays exponentially: pⁿ → 0 as n → ∞.

This has profound implications for neural network optimization.

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Classic saddle: f(x,y) = x^2 - y^2
f_saddle = lambda x, y: x**2 - y**2

x = np.linspace(-2, 2, 100)
y = np.linspace(-2, 2, 100)
X, Y = np.meshgrid(x, y)
Z = f_saddle(X, Y)

fig = plt.figure(figsize=(14, 5))

ax1 = fig.add_subplot(131, projection='3d')
ax1.plot_surface(X, Y, Z, cmap='RdBu', alpha=0.85)
ax1.scatter([0], [0], [0], color='red', s=100, zorder=10)
ax1.set_title('f(x,y) = x² - y²\nSaddle at origin')
ax1.set_xlabel('x'); ax1.set_ylabel('y'); ax1.set_zlabel('f')

ax2 = fig.add_subplot(132)
cs = ax2.contourf(X, Y, Z, levels=20, cmap='RdBu')
ax2.contour(X, Y, Z, levels=[0], colors='black', linewidths=2)
ax2.scatter([0], [0], color='red', s=100, zorder=8, label='saddle point')
ax2.set_title('Contour view\nRed=high, Blue=low')
ax2.set_aspect('equal')
ax2.legend()
plt.colorbar(cs, ax=ax2)

# Slices through saddle
ax3 = fig.add_subplot(133)
t = np.linspace(-2, 2, 200)
ax3.plot(t, t**2,  color='steelblue', linewidth=2.5, label='f(t, 0) = t² (minimum along x)')
ax3.plot(t, -t**2, color='red', linewidth=2.5, label='f(0, t) = -t² (maximum along y)')
ax3.scatter([0], [0], color='black', s=100, zorder=8)
ax3.axhline(0, color='gray', linewidth=0.8)
ax3.set_title('Slices through origin\nMin in x-dir, max in y-dir')
ax3.set_xlabel('t')
ax3.legend(fontsize=9)

plt.suptitle('Saddle Points: Minimum in Some Directions, Maximum in Others', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()

# Gradient at the saddle point is zero — gradient descent stalls there
grad_saddle = lambda x, y: np.array([2*x, -2*y])

print('Gradient at origin (the saddle point):')
print(f'  ∇f(0, 0) = {grad_saddle(0, 0)}')
print()
print('Gradient away from origin:')
for x0, y0 in [(0.1, 0.0), (0.0, 0.1), (0.05, 0.05)]:
    g = grad_saddle(x0, y0)
    print(f'  ∇f({x0:.2f}, {y0:.2f}) = {g}  (magnitude={np.linalg.norm(g):.4f})')

print()
print('Near the saddle, the gradient is tiny — GD is very slow.')
print('This is why saddle points can severely slow training.')

# Simulate GD near saddle (starts just slightly off-center)
w = np.array([0.001, 0.001])  # tiny perturbation
lr = 0.1
path = [w.copy()]
for _ in range(200):
    g = grad_saddle(*w)
    w = w - lr * g
    path.append(w.copy())
path = np.array(path)

fig, axes = plt.subplots(1, 2, figsize=(13, 4))
axes[0].semilogy(np.linalg.norm(np.array([grad_saddle(*p) for p in path]), axis=1) + 1e-15, color='steelblue', linewidth=2)
axes[0].set_xlabel('Step'); axes[0].set_ylabel('||gradient||')
axes[0].set_title('Gradient magnitude near saddle\nSlows to near-zero, then escapes')

axes[1].plot(path[:, 0], path[:, 1], '-', color='steelblue', linewidth=2)
axes[1].scatter(*path[0], color='orange', s=100, zorder=8, label='start')
axes[1].scatter(*path[-1], color='green', s=100, zorder=8, label='end')
cs = axes[1].contourf(X, Y, Z, levels=15, cmap='RdBu', alpha=0.5)
axes[1].set_title('GD path starting near saddle')
axes[1].legend(); axes[1].set_aspect('equal')

plt.tight_layout()
plt.show()

Gradient at origin (the saddle point):
  ∇f(0, 0) = [0 0]

Gradient away from origin:
  ∇f(0.10, 0.00) = [ 0.2 -0. ]  (magnitude=0.2000)
  ∇f(0.00, 0.10) = [ 0.  -0.2]  (magnitude=0.2000)
  ∇f(0.05, 0.05) = [ 0.1 -0.1]  (magnitude=0.1414)

Near the saddle, the gradient is tiny — GD is very slow.
This is why saddle points can severely slow training.

2. Identifying Saddle Points via the Hessian¶

At a stationary point where ∇f = 0, the Hessian matrix H (the matrix of second partial derivatives) tells us the type:

All eigenvalues of H > 0: local minimum
All eigenvalues of H < 0: local maximum
Mixed eigenvalues: saddle point

(The Hessian is formalized in ch217 — Second Derivatives.)

# For f(x,y) = x^2 - y^2:
# H = [[d^2f/dx^2, d^2f/dxdy], [d^2f/dydx, d^2f/dy^2]]
#   = [[2, 0], [0, -2]]

H = np.array([[2.0, 0.0],
              [0.0, -2.0]])
eigenvalues, eigenvectors = np.linalg.eig(H)

print('Hessian at origin for f(x,y) = x² - y²:')
print(f'  H = {H}')
print(f'  Eigenvalues: {eigenvalues}')
print(f'  → Mixed sign eigenvalues → SADDLE POINT')
print()

# For f(x,y) = x^2 + y^2 (minimum):
H_min = np.array([[2.0, 0.0], [0.0, 2.0]])
ev_min = np.linalg.eigvals(H_min)
print(f'Hessian for f(x,y) = x² + y²:')
print(f'  Eigenvalues: {ev_min}  → all positive → LOCAL MINIMUM')
print()

# For f(x,y) = -x^2 - y^2 (maximum):
H_max = np.array([[-2.0, 0.0], [0.0, -2.0]])
ev_max = np.linalg.eigvals(H_max)
print(f'Hessian for f(x,y) = -x² - y²:')
print(f'  Eigenvalues: {ev_max}  → all negative → LOCAL MAXIMUM')

Hessian at origin for f(x,y) = x² - y²:
  H = [[ 2.  0.]
 [ 0. -2.]]
  Eigenvalues: [ 2. -2.]
  → Mixed sign eigenvalues → SADDLE POINT

Hessian for f(x,y) = x² + y²:
  Eigenvalues: [2. 2.]  → all positive → LOCAL MINIMUM

Hessian for f(x,y) = -x² - y²:
  Eigenvalues: [-2. -2.]  → all negative → LOCAL MAXIMUM

3. Escaping Saddle Points¶

Gradient descent with noise (SGD) naturally perturbs the trajectory, helping escape saddle points. Dedicated methods include:

Adding Gaussian noise to gradients
Momentum-based methods (accumulate gradient history)
Second-order methods (use Hessian curvature directly)

4. Summary¶

Saddle points: ∇f = 0 but not a minimum or maximum
Identified by the Hessian: mixed-sign eigenvalues → saddle
Gradient is zero at saddle → gradient descent stalls
In high dimensions, saddle points are common — more common than local minima
SGD noise and momentum help escape saddle points

5. Forward References¶

The Hessian matrix — used to classify critical points — is the subject of ch217 — Second Derivatives. Momentum-based methods that escape saddle points more efficiently are discussed in ch291 — Optimization Methods (Part IX). The probability of being near a saddle vs a minimum in high-dimensional random functions connects to random matrix theory, relevant in Part VIII — Probability.