Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

1. From single samples to batches

In ch303 the forward pass operated on a single input vector xRn0x \in \mathbb{R}^{n_0}. In practice, we process BB samples simultaneously — a batch — to exploit vectorised hardware (GPU SIMD units). The computation becomes:

Z[l]=W[l]A[l1]+b[l]Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}

where A[l1]Rnl1×BA^{[l-1]} \in \mathbb{R}^{n_{l-1} \times B} holds BB column vectors, Z[l]Rnl×BZ^{[l]} \in \mathbb{R}^{n_l \times B}, and b[l]b^{[l]} broadcasts over the batch dimension.

The entire batch processes in one matrix multiplication — orders of magnitude faster than a Python loop over samples.

(Broadcasting mechanics from NumPy: ch147. Matrix multiplication: ch153.)

import numpy as np
import time
import matplotlib.pyplot as plt


def relu(z: np.ndarray) -> np.ndarray:
    return np.maximum(0.0, z)


def sigmoid(z: np.ndarray) -> np.ndarray:
    return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))


def forward_pass_batched(X: np.ndarray, params: list, activations: list) -> tuple:
    """
    Batched forward pass.

    Parameters
    ----------
    X : ndarray, shape (n_features, batch_size)
    params : list of (W, b) tuples per layer
    activations : list of activation functions (one per layer)

    Returns
    -------
    output : final layer activations, shape (n_out, batch_size)
    cache  : list of (A_prev, Z, A) per layer — needed for backprop
    """
    cache = []
    A = X
    for (W, b), act_fn in zip(params, activations):
        A_prev = A
        # b[:, None] broadcasts bias over all B samples in the batch
        Z = W @ A + b[:, None]
        A = act_fn(Z)
        cache.append((A_prev, Z, A))
    return A, cache


# --- Build a small network and time batched vs loop ---
rng = np.random.default_rng(42)
layer_sizes = [784, 256, 128, 10]   # MNIST-like architecture

params = []
for i in range(len(layer_sizes) - 1):
    fan_in, fan_out = layer_sizes[i], layer_sizes[i + 1]
    scale = np.sqrt(2.0 / fan_in)  # He initialisation — see ch308
    W = rng.normal(0, scale, (fan_out, fan_in))
    b = np.zeros(fan_out)
    params.append((W, b))

act_fns = [relu, relu, sigmoid]

# Batched: (784, 256) input
B = 256
X_batch = rng.normal(0, 1, (784, B))

t0 = time.perf_counter()
out_batch, cache = forward_pass_batched(X_batch, params, act_fns)
t_batch = time.perf_counter() - t0

# Loop: one sample at a time
t0 = time.perf_counter()
outs = []
for i in range(B):
    out_i, _ = forward_pass_batched(X_batch[:, i:i+1], params, act_fns)
    outs.append(out_i)
t_loop = time.perf_counter() - t0

print(f"Input shape:  {X_batch.shape}")
print(f"Output shape: {out_batch.shape}")
print(f"\nBatched forward pass:  {t_batch*1000:.2f} ms")
print(f"Loop forward pass:     {t_loop*1000:.2f} ms")
print(f"Speedup:               {t_loop/t_batch:.1f}x")

# Verify equivalence
out_loop_stacked = np.hstack(outs)
print(f"\nMax absolute difference (batched vs loop): {np.max(np.abs(out_batch - out_loop_stacked)):.2e}")
Input shape:  (784, 256)
Output shape: (10, 256)

Batched forward pass:  6.28 ms
Loop forward pass:     35.85 ms
Speedup:               5.7x

Max absolute difference (batched vs loop): 7.77e-16
# Inspect the cache — what gets stored for backprop
print("Cache contents (needed by backpropagation in ch306):")
for l, (A_prev, Z, A) in enumerate(cache):
    print(f"  Layer {l+1}: A_prev {A_prev.shape}, Z {Z.shape}, A {A.shape}")

# Activation statistics (healthy forward pass diagnostics)
fig, axes = plt.subplots(1, len(cache), figsize=(14, 3))
for l, (_, Z, A) in enumerate(cache):
    ax = axes[l]
    ax.hist(A.ravel(), bins=50, color='#3498db', alpha=0.75, edgecolor='white')
    ax.set_title(f'Layer {l+1} activations\nmean={A.mean():.3f}, std={A.std():.3f}')
    ax.set_xlabel('Activation value')
    if l == 0:
        ax.set_ylabel('Count')
plt.suptitle('Activation distributions across layers (He init + ReLU)', y=1.02)
plt.tight_layout()
plt.savefig('ch304_activations.png', dpi=120)
plt.show()

print("\nHealthy signs: roughly stable std across layers.")
print("Unhealthy: std collapsing to 0 (vanishing) or exploding to >10 (exploding gradients).")
Cache contents (needed by backpropagation in ch306):
  Layer 1: A_prev (784, 256), Z (256, 256), A (256, 256)
  Layer 2: A_prev (256, 256), Z (128, 256), A (128, 256)
  Layer 3: A_prev (128, 256), Z (10, 256), A (10, 256)
<Figure size 1400x300 with 3 Axes>

Healthy signs: roughly stable std across layers.
Unhealthy: std collapsing to 0 (vanishing) or exploding to >10 (exploding gradients).

2. Shape accounting — the discipline of forward passes

The single most common source of errors in implementing networks is shape mismatches. Every tensor in the forward pass has a precise shape. Track them explicitly:

LayerOperationInput shapeWeight shapeOutput shape
1Linear(784,B)(784, B)(256,784)(256, 784)(256,B)(256, B)
1ReLU(256,B)(256, B)(256,B)(256, B)
2Linear(256,B)(256, B)(128,256)(128, 256)(128,B)(128, B)
2ReLU(128,B)(128, B)(128,B)(128, B)
3Linear(128,B)(128, B)(10,128)(10, 128)(10,B)(10, B)
3Sigmoid(10,B)(10, B)(10,B)(10, B)

Rule: W[l]W^{[l]} has shape (nl,nl1)(n_l, n_{l-1}). Multiplying W[l]@A[l1]W^{[l]} @ A^{[l-1]} contracts the nl1n_{l-1} dimension. The batch dimension BB is never touched by the weights.


3. The cache: why we store intermediate values

The cache stores (A[l1],Z[l],A[l])(A^{[l-1]}, Z^{[l]}, A^{[l]}) for every layer. During the backward pass (ch306):

  • A[l1]A^{[l-1]} is needed to compute L/W[l]\partial \mathcal{L} / \partial W^{[l]}

  • Z[l]Z^{[l]} is needed to compute the derivative of the activation σ(Z[l])\sigma'(Z^{[l]})

Memory usage scales as O(LBmaxlnl)O(L \cdot B \cdot \max_l n_l). For large models and large batches, this becomes the primary memory constraint — not the parameter count.


4. Numerical considerations

The forward pass is where numerical problems originate:

  • Overflow in sigmoid: exp(-z) for very negative z and exp(z) for very positive z. Fix: clip zz before exponentiation, or use log-sum-exp tricks.

  • Dead ReLU: if z is always negative, relu(z) = 0 permanently. The gradient of ReLU is zero for negative inputs, so the neuron cannot recover. Fix: He initialisation (ch308), careful learning rate.

  • Activation saturation: sigmoid/tanh squash large inputs to flat regions where gradient ≈ 0. Fix: use ReLU variants in hidden layers.


5. Summary

  • The forward pass processes a full batch in one matrix operation per layer.

  • Shape tracking is mandatory: W[l]W^{[l]} maps (nl1,B)(n_{l-1}, B) to (nl,B)(n_l, B).

  • The cache stores intermediate activations for use during backpropagation.

  • Numerical stability requires careful handling of activations and initialisations.


6. Forward and backward references

Used here: matrix multiplication shapes (ch153–ch155), broadcasting (ch147), activation functions (ch064–065), multilayer structure (ch303).

This will reappear in ch306 — Backpropagation from Scratch, where the cache built here is consumed in reverse to compute every gradient needed for parameter updates.