ch328 — Graph Neural Networks - Mathematics for Programmers

1. When data is a graph¶

Not all data lives in a grid (images) or sequence (text). Molecules, social networks, knowledge bases, and citation graphs are naturally represented as graphs.

A graph $G = (V, E)$ has nodes $v_i$ with features $h_i$ and edges $(v_i, v_j)$ with optional features $e_{ij}$ . The key challenge: define neural operations that respect graph structure.

(Graph embedding: ch195. Matrix operations: ch151.)

2. Message Passing Framework¶

All major GNN variants fit the message passing paradigm:

h_i^{(l+1)} = \text{UPDATE}\left(h_i^{(l)},\; \text{AGGREGATE}\left(\{\text{MSG}(h_i^{(l)}, h_j^{(l)}, e_{ij}) : j \in \mathcal{N}(i)\}\right)\right)

(1)

MSG: each neighbour sends a message to $v_i$ .
AGGREGATE: combine all incoming messages (sum, mean, max).
UPDATE: update $v_i$ ’s representation given its old state and aggregated messages.

After $L$ rounds of message passing, each node’s representation incorporates information from its $L$ -hop neighbourhood.

3. Graph Convolutional Network (GCN)¶

H^{(l+1)} = \sigma\left(\tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2} H^{(l)} W^{(l)}\right)

(2)

where $\tilde{A} = A + I$ (adjacency + self-loops) and $\tilde{D}$ is the degree matrix.

The symmetric normalisation $\tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2}$ is a degree-normalised averaging of neighbour features. Each layer is a linear transformation of the averaged neighbourhood.

(Eigendecomposition motivation: ch175. Adjacency matrix: ch195.)

import numpy as np
import matplotlib.pyplot as plt


def relu(z): return np.maximum(0, z)
def sigmoid(z): return 1/(1+np.exp(-np.clip(z,-500,500)))


class GCNLayer:
    """Single Graph Convolutional Network layer."""

    def __init__(self, in_features: int, out_features: int, seed: int = 0):
        rng = np.random.default_rng(seed)
        scale = np.sqrt(2.0 / in_features)
        self.W = rng.normal(0, scale, (in_features, out_features))
        self.b = np.zeros(out_features)

    def _normalise_adjacency(self, A: np.ndarray) -> np.ndarray:
        """Compute D^{-1/2} (A+I) D^{-1/2}."""
        A_hat = A + np.eye(A.shape[0])
        D = np.diag(A_hat.sum(axis=1))
        D_inv_sqrt = np.diag(1.0 / (np.sqrt(D.diagonal()) + 1e-8))
        return D_inv_sqrt @ A_hat @ D_inv_sqrt

    def forward(self, H: np.ndarray, A: np.ndarray,
                act: bool = True) -> np.ndarray:
        """H: (N, in_features). A: (N,N). Returns (N, out_features)."""
        A_norm = self._normalise_adjacency(A)
        out = A_norm @ H @ self.W + self.b
        return relu(out) if act else out


# Build a small graph (Karate Club-style structure)
rng = np.random.default_rng(42)
N = 20
# Two communities (0-9, 10-19) with sparse cross-community edges
A = np.zeros((N, N))
for i in range(10):
    for j in range(i+1, 10):
        if rng.random() < 0.4:
            A[i,j] = A[j,i] = 1.0
for i in range(10, N):
    for j in range(i+1, N):
        if rng.random() < 0.4:
            A[i,j] = A[j,i] = 1.0
# Few cross-community edges
for _ in range(3):
    i, j = rng.integers(0,10), rng.integers(10,N)
    A[i,j] = A[j,i] = 1.0

# Node features: random 8-dim
H0 = rng.normal(0, 1, (N, 8))
layer_label = np.array([0]*10 + [1]*10)  # community label

# Two GCN layers
gcn1 = GCNLayer(8, 16, seed=0)
gcn2 = GCNLayer(16, 2, seed=1)

H1 = gcn1.forward(H0, A, act=True)
H2 = gcn2.forward(H1, A, act=False)  # 2D output for visualisation

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Draw graph
def draw_graph(ax, A, colours, title):
    pos = {}
    for i in range(N):
        angle = 2 * np.pi * i / N
        r = 1.5 if i < 10 else 1.0
        pos[i] = (r * np.cos(angle + np.pi*int(i>=10)), r * np.sin(angle + np.pi*int(i>=10)))
    for i in range(N):
        for j in range(i+1, N):
            if A[i,j] > 0:
                ax.plot([pos[i][0], pos[j][0]], [pos[i][1], pos[j][1]], 'gray', lw=0.7, alpha=0.5)
    for i in range(N):
        ax.scatter(*pos[i], color=colours[i], s=120, zorder=5, edgecolors='black', lw=1)
    ax.set_aspect('equal'); ax.set_title(title); ax.axis('off')

node_colors = ['#e74c3c' if l==0 else '#3498db' for l in layer_label]
draw_graph(axes[0], A, node_colors, 'Graph (2 communities)')

axes[1].scatter(H1[:10, 0], H1[:10, 1], color='#e74c3c', s=60, label='Community 0')
axes[1].scatter(H1[10:, 0], H1[10:, 1], color='#3498db', s=60, label='Community 1')
axes[1].set_title('After GCN layer 1 (16-dim → 2 shown)'); axes[1].legend()

axes[2].scatter(H2[:10, 0], H2[:10, 1], color='#e74c3c', s=60, label='Community 0')
axes[2].scatter(H2[10:, 0], H2[10:, 1], color='#3498db', s=60, label='Community 1')
axes[2].set_title('After GCN layer 2 (2-dim output)'); axes[2].legend()

plt.suptitle('GCN: message passing separates community structure', fontsize=11)
plt.tight_layout()
plt.savefig('ch328_gnn.png', dpi=120)
plt.show()

4. Graph Attention Networks (GAT)¶

Instead of fixed normalisation weights, GAT learns attention coefficients:

\alpha_{ij} = \frac{\exp(\text{LeakyReLU}(a^\top [W h_i \| W h_j]))}{\sum_{k \in \mathcal{N}(i)} \exp(\text{LeakyReLU}(a^\top [W h_i \| W h_k]))}

(3)

This allows the model to attend differently to different neighbours, capturing heterogeneous relationship strengths.

5. Applications¶

Domain	Task	GNN variant
Drug discovery	Molecular property prediction	MPNN, GCN
Social networks	Community detection, link prediction	GraphSAGE, GAT
Knowledge graphs	Entity classification, relation prediction	R-GCN
Recommendation	Item-user interaction graphs	LightGCN
Physics simulation	Particle/rigid body dynamics	GNS

6. Summary¶

GNNs process graph-structured data via message passing: aggregate neighbour features, update node state.
GCN: symmetric normalised adjacency times feature matrix — simple, effective.
GAT: learned attention weights over neighbours — flexible, more expressive.
$L$ layers → $L$ -hop neighbourhood aggregation.

7. Forward and backward references¶

Used here: graph embeddings (ch195), adjacency matrix (ch195), attention (ch321), layer normalisation (ch310).

This will reappear in ch340 — Capstone II, where graph structure is used to represent relationships between data features.