ch244 — Probability Rules - Mathematics for Programmers

Part VIII: Probability | Computational Mathematics for Programmers

1. The Kolmogorov Axioms¶

All of probability theory rests on three axioms, stated by Andrei Kolmogorov in 1933. Everything else is derived from them.

Given a sample space Ω and a collection of events:

Axiom 1 (Non-negativity): For any event E, P(E) ≥ 0

Axiom 2 (Normalization): P(Ω) = 1

Axiom 3 (Countable additivity): For mutually exclusive events E₁, E₂, ... (where Eᵢ ∩ Eⱼ = ∅ for i ≠ j):

P\left(\bigcup_{i=1}^{\infty} E_i\right) = \sum_{i=1}^{\infty} P(E_i)

(1)

Every probability rule in this chapter is a theorem derivable from these three axioms. You don’t memorize rules — you derive them.

import numpy as np
from itertools import product

# Verify axioms hold for our two-dice space
Omega = list(product(range(1, 7), range(1, 7)))
n = len(Omega)

# Assign uniform probability
prob = {outcome: 1/n for outcome in Omega}

def P(event):
    """Probability of a set of outcomes."""
    return sum(prob[o] for o in event)

E_sum7 = frozenset(o for o in Omega if sum(o) == 7)
E_sum8 = frozenset(o for o in Omega if sum(o) == 8)

print("Axiom 1: P(E) >= 0")
print(f"  P(sum=7) = {P(E_sum7):.4f}  >= 0: {P(E_sum7) >= 0}")

print("\nAxiom 2: P(Ω) = 1")
print(f"  P(Ω) = {P(Omega):.4f}")

print("\nAxiom 3: P(E ∪ F) = P(E) + P(F) when E ∩ F = ∅")
print(f"  E_sum7 ∩ E_sum8 = {len(E_sum7 & E_sum8)} outcomes (mutually exclusive)")
print(f"  P(sum=7) + P(sum=8) = {P(E_sum7) + P(E_sum8):.4f}")
print(f"  P(sum=7 or sum=8) = {P(E_sum7 | E_sum8):.4f}")

Axiom 1: P(E) >= 0
  P(sum=7) = 0.1667  >= 0: True

Axiom 2: P(Ω) = 1
  P(Ω) = 1.0000

Axiom 3: P(E ∪ F) = P(E) + P(F) when E ∩ F = ∅
  E_sum7 ∩ E_sum8 = 0 outcomes (mutually exclusive)
  P(sum=7) + P(sum=8) = 0.3056
  P(sum=7 or sum=8) = 0.3056

2. Derived Rules¶

Complement Rule¶

P(E^c) = 1 - P(E)

(2)

Proof: E and Eᶜ are mutually exclusive and E ∪ Eᶜ = Ω. By Axiom 3: P(E) + P(Eᶜ) = P(Ω) = 1.

Inclusion-Exclusion¶

For any two events (not necessarily mutually exclusive):

P(E \cup F) = P(E) + P(F) - P(E \cap F)

(3)

Motivation: Naively adding P(E) and P(F) double-counts outcomes in E ∩ F.

# Single die for clarity
Omega_d6 = set(range(1, 7))
p_d6 = {x: 1/6 for x in Omega_d6}

def P_d6(event):
    return sum(p_d6[o] for o in event if o in p_d6)

E_even = {2, 4, 6}
E_prime = {2, 3, 5}

# Inclusion-exclusion
lhs = P_d6(E_even | E_prime)
rhs = P_d6(E_even) + P_d6(E_prime) - P_d6(E_even & E_prime)

print("Inclusion-Exclusion: P(E ∪ F) = P(E) + P(F) - P(E ∩ F)")
print(f"  P(even) = {P_d6(E_even):.4f}")
print(f"  P(prime) = {P_d6(E_prime):.4f}")
print(f"  P(even ∩ prime) = P({{2}}) = {P_d6(E_even & E_prime):.4f}")
print(f"  LHS P(even ∪ prime) = {lhs:.4f}")
print(f"  RHS P(even) + P(prime) - P(even ∩ prime) = {rhs:.4f}")
print(f"  Match: {abs(lhs - rhs) < 1e-10}")

Inclusion-Exclusion: P(E ∪ F) = P(E) + P(F) - P(E ∩ F)
  P(even) = 0.5000
  P(prime) = 0.5000
  P(even ∩ prime) = P({2}) = 0.1667
  LHS P(even ∪ prime) = 0.8333
  RHS P(even) + P(prime) - P(even ∩ prime) = 0.8333
  Match: True

3. The Addition Rule in Code: A Decision Pattern¶

A common programming pattern: compute the probability of a “bad” event by computing 1 minus the probability of “everything good”. This is the complement rule applied strategically.

# Q: Roll 4 dice. What is the probability of getting at least one 6?
# Direct: enumerate all outcomes with at least one 6 — complex
# Complement: P(at least one 6) = 1 - P(no sixes)

p_no_six_one_die = 5/6
n_dice = 4

# Because each die is independent, P(no six on all 4) = (5/6)^4
p_no_six_all = p_no_six_one_die ** n_dice
p_at_least_one_six = 1 - p_no_six_all

print(f"P(no 6 on 4 dice) = (5/6)^4 = {p_no_six_all:.6f}")
print(f"P(at least one 6) = 1 - (5/6)^4 = {p_at_least_one_six:.6f}")

# Verify by simulation
rng = np.random.default_rng(seed=0)
N = 1_000_000
rolls = rng.integers(1, 7, size=(N, 4))
at_least_one_six = np.any(rolls == 6, axis=1)
print(f"\nSimulation ({N:,} trials): P(at least one 6) = {at_least_one_six.mean():.6f}")

P(no 6 on 4 dice) = (5/6)^4 = 0.482253
P(at least one 6) = 1 - (5/6)^4 = 0.517747

Simulation (1,000,000 trials): P(at least one 6) = 0.517987

4. Independence¶

Two events E and F are independent if:

P(E \cap F) = P(E) \cdot P(F)

(4)

Intuitively: knowing E occurred gives no information about whether F occurred. Independence is a property of the probability assignment — not a logical or causal claim.

# Two dice: are 'die1 is even' and 'die2 is even' independent?
Omega2 = list(product(range(1, 7), range(1, 7)))
prob2 = {o: 1/36 for o in Omega2}

def P2(event):
    return sum(prob2[o] for o in event)

E_d1_even = frozenset(o for o in Omega2 if o[0] % 2 == 0)
E_d2_even = frozenset(o for o in Omega2 if o[1] % 2 == 0)

p_e = P2(E_d1_even)
p_f = P2(E_d2_even)
p_ef = P2(E_d1_even & E_d2_even)

print(f"P(d1 even) = {p_e:.4f}")
print(f"P(d2 even) = {p_f:.4f}")
print(f"P(d1 even AND d2 even) = {p_ef:.4f}")
print(f"P(d1 even) × P(d2 even) = {p_e * p_f:.4f}")
print(f"Independent: {abs(p_ef - p_e * p_f) < 1e-10}")

P(d1 even) = 0.5000
P(d2 even) = 0.5000
P(d1 even AND d2 even) = 0.2500
P(d1 even) × P(d2 even) = 0.2500
Independent: True

5. Partition Rule (Law of Total Probability — Preview)¶

If {B₁, B₂, ..., Bₙ} partition Ω (mutually exclusive, collectively exhaustive), then:

P(A) = \sum_{i=1}^{n} P(A \cap B_i)

(5)

This decomposes a complex event into simpler pieces. The full form, using conditional probability, appears in ch245.

# Partition Ω_d6 into odds and evens, compute P(prime) via partition
B_odd = {1, 3, 5}   # partition piece 1
B_even = {2, 4, 6}  # partition piece 2
A_prime = {2, 3, 5}

# Verify partition
assert B_odd | B_even == Omega_d6
assert B_odd & B_even == set()

p_A = P_d6(A_prime)
p_A_via_partition = P_d6(A_prime & B_odd) + P_d6(A_prime & B_even)

print(f"P(prime) directly = {p_A:.4f}")
print(f"P(prime ∩ odd) + P(prime ∩ even) = {p_A_via_partition:.4f}")
print(f"Match: {abs(p_A - p_A_via_partition) < 1e-10}")

P(prime) directly = 0.5000
P(prime ∩ odd) + P(prime ∩ even) = 0.5000
Match: True

6. Summary¶

All probability theory derives from three Kolmogorov axioms: non-negativity, normalization, countable additivity.
Complement rule: P(Eᶜ) = 1 − P(E). Use it to simplify “at least one” problems.
Inclusion-exclusion: P(E ∪ F) = P(E) + P(F) − P(E ∩ F). Subtract the double-counted intersection.
Independence: P(E ∩ F) = P(E)·P(F). A property of the probability model, not of physical causation.

7. Forward References¶

Conditional probability in ch245 extends these rules to the case where partial information changes the probability assignment. Bayes’ theorem in ch246 combines the conditional probability rule with the partition rule to derive posterior probabilities — the engine of Bayesian reasoning in ML.