Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

1. The position problem

Self-attention is permutation equivariant: if you shuffle the input tokens, the output shuffles identically. This means the Transformer has no built-in notion of order.

For sequences, position matters: “dog bites man” ≠ “man bites dog”. Without positional information, the model cannot distinguish them.

Positional encoding injects position information into the token representations before they enter the Transformer.


2. Sinusoidal positional encoding (Vaswani et al., 2017)

PE(pos,2i)=sin(pos100002i/dmodel)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)
PE(pos,2i+1)=cos(pos100002i/dmodel)PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

Each dimension uses a different frequency: low-index dimensions have high frequency (change rapidly with position), high-index dimensions have low frequency (change slowly). This creates a unique “fingerprint” for each position.

Key property: PEpos+kPE_{pos+k} can be expressed as a linear function of PEposPE_{pos}, so the model can learn to attend to relative positions.

(Sine and cosine: ch109. Fourier series intuition: ch115.)

import numpy as np
import matplotlib.pyplot as plt


def sinusoidal_positional_encoding(T: int, d_model: int) -> np.ndarray:
    """Returns PE matrix of shape (T, d_model)."""
    PE = np.zeros((T, d_model))
    pos = np.arange(T)[:, None]           # (T, 1)
    i   = np.arange(0, d_model, 2)[None, :]  # (1, d_model//2)
    div_term = 10000 ** (i / d_model)    # (1, d_model//2)

    PE[:, 0::2] = np.sin(pos / div_term)  # even indices
    PE[:, 1::2] = np.cos(pos / div_term)  # odd indices
    return PE


T, d_model = 64, 128
PE = sinusoidal_positional_encoding(T, d_model)

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Full heatmap
im = axes[0].imshow(PE, cmap='RdBu_r', aspect='auto', vmin=-1, vmax=1)
axes[0].set_title(f'Sinusoidal PE ({T}×{d_model})')
axes[0].set_xlabel('Dimension'); axes[0].set_ylabel('Position')
plt.colorbar(im, ax=axes[0])

# First few dimensions
for dim in [0, 2, 10, 30]:
    axes[1].plot(PE[:, dim], label=f'dim {dim}', lw=1.5)
axes[1].set_title('Individual PE dimensions\n(different frequencies)')
axes[1].set_xlabel('Position'); axes[1].set_ylabel('Value')
axes[1].legend(fontsize=8)

# Cosine similarity between positions (shows relative structure)
norms = np.linalg.norm(PE, axis=1, keepdims=True)
PE_norm = PE / (norms + 1e-10)
sim = PE_norm @ PE_norm.T
im2 = axes[2].imshow(sim, cmap='viridis', aspect='auto')
axes[2].set_title('Cosine similarity between PE vectors\n(nearby positions are more similar)')
axes[2].set_xlabel('Position'); axes[2].set_ylabel('Position')
plt.colorbar(im2, ax=axes[2])

plt.tight_layout()
plt.savefig('ch323_positional_encoding.png', dpi=120)
plt.show()

print("PE shape:", PE.shape)
print("Min:", PE.min(), "  Max:", PE.max())
print("Nearby positions are more similar (as expected from sinusoidal structure):")
for offset in [1, 4, 16, 32]:
    sims = [sim[p, p+offset] for p in range(T-offset)]
    print(f"  Average cos sim at offset {offset:2d}: {np.mean(sims):.3f}")
# Learned positional embeddings (GPT-style)
class LearnedPositionalEmbedding:
    """Position embeddings as a simple lookup table — learned during training."""

    def __init__(self, max_len: int, d_model: int, seed: int = 0):
        rng = np.random.default_rng(seed)
        # Initialise as small random vectors; will be learned
        self.PE = rng.normal(0, 0.02, (max_len, d_model))
        self.max_len = max_len

    def forward(self, T: int) -> np.ndarray:
        assert T <= self.max_len, f"Sequence length {T} exceeds max_len {self.max_len}"
        return self.PE[:T]


# Add positional encoding to token embeddings
rng = np.random.default_rng(0)
T_demo, d = 10, 32
token_embeds = rng.normal(0, 1, (T_demo, d))  # from embedding table (ch320)

pe_fixed   = sinusoidal_positional_encoding(T_demo, d)
lpe        = LearnedPositionalEmbedding(512, d)
pe_learned = lpe.forward(T_demo)

x_fixed   = token_embeds + pe_fixed
x_learned = token_embeds + pe_learned

print("Token embeddings + positional encoding:")
print(f"  Token embeds std:   {token_embeds.std():.3f}")
print(f"  Sinusoidal PE std:  {pe_fixed.std():.3f}")
print(f"  Learned PE std:     {pe_learned.std():.3f}")
print(f"  Combined std:       {x_fixed.std():.3f}")
print()
print("Sinusoidal PE: fixed, no parameters, generalises to longer sequences.")
print("Learned PE:    flexible, adds T_max * d_model parameters, cannot generalise beyond max_len.")
print("Modern LLMs use Rotary PE (RoPE) or ALiBi — see extension notes below.")

3. Rotary Positional Encoding (RoPE)

RoPE (Su et al., 2021) encodes position by rotating query and key vectors before the attention dot product. The rotation angle depends on position and dimension frequency.

Key advantage: relative positions emerge naturally from the dot product geometry — qmknq_m \cdot k_n depends only on mnm-n, not on mm and nn separately.

Used in LLaMA, Mistral, GPT-NeoX, and most modern open-source LLMs.


4. ALiBi

ALiBi (Press et al., 2022) adds a bias to attention logits proportional to key-query distance:

scoreij=qikjdkmij\text{score}_{ij} = \frac{q_i \cdot k_j}{\sqrt{d_k}} - m |i-j|

No additional parameters; generalises well beyond training length.


5. Summary

  • Self-attention is position-agnostic; positional encoding injects order information.

  • Sinusoidal PE: unique, theoretically motivated, generalises to unseen lengths.

  • Learned PE: flexible, limited to trained max length.

  • RoPE: rotates Q/K vectors; best relative position generalisation; current standard.

  • All methods add PE to token embeddings before the first Transformer layer.


6. Forward and backward references

Used here: sine/cosine (ch109), embeddings (ch320), self-attention (ch321), Transformer block (ch322).

This will reappear in ch337 — Project: Transformer Block from Scratch, where sinusoidal PE is implemented and added to token embeddings before the attention layers.