Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Part IX — Statistics and Data Science

Chapters 271–300


What This Part Covers

Parts I–VIII gave you the mathematical infrastructure: logic, numbers, functions, geometry, vectors, linear algebra, calculus, and probability. Every one of those tools was built for a reason. Part IX is where you find out what that reason is.

Statistics and data science are the disciplines that connect mathematics to reality — specifically, to reality that is noisy, incomplete, and uncertain. You will learn how to:

  • Describe and summarize data rigorously

  • Reason about populations from samples

  • Test hypotheses and quantify uncertainty

  • Build, evaluate, and critique predictive models

  • Understand the information content of data

  • Implement core ML algorithms from mathematical first principles

  • Scale computations to data that does not fit in memory

The final chapter (300) assembles everything into a complete, working end-to-end AI system.


The Mental Shift

In Parts VII and VIII, the world was clean. Probability distributions were exact. Integrals converged. Derivatives existed.

In Part IX, the world is not clean. You have a CSV with missing values, duplicates, and inconsistent units. You have 10,000 samples from a process you cannot fully observe. You have a model that performs perfectly on training data and fails in production.

The shift: stop asking “what is the true answer?” and start asking “what is the best answer I can infer, and how confident should I be?”

This requires combining:

  • Probability (Part VIII) — to model uncertainty formally

  • Linear algebra (Part VI) — to represent and transform data efficiently

  • Calculus (Part VII) — to optimize model parameters

  • Functions (Part III) — to express hypotheses about relationships

Statistics is not a separate discipline from what came before. It is what all those tools look like when applied to real data.


Map of Part IX

Data & Measurement (271)
        │
        ▼
Data Cleaning (272) ──────────────────────────────────────┐
        │                                                  │
        ▼                                                  ▼
Descriptive Statistics (273)                    Feature Engineering (291)
Visualization (274)                             Dimensionality Reduction (292)
Sampling (275)                                          │
        │                                               ▼
        ▼                                     Clustering (293)
Bias & Variance (276)                         Classification (294)
Hypothesis Testing (277)                               │
p-values (278)                                         ▼
Confidence Intervals (279)                   Neural Network Math Review (295)
Correlation (280)                            Optimization Methods (296)
        │                                    Large Scale Data (297)
        ▼                                             │
Regression (281)                                      ▼
Model Evaluation (282)                     Project: Full Pipeline (298)
Overfitting (283)                          Project: Mini ML Library (299)
Cross Validation (284)                              │
A/B Testing (285)                                   ▼
        │                              CAPSTONE: End-to-End AI (300)
        ▼
Bayesian Statistics (286)
Information Theory (287)
Entropy (288)
KL Divergence (289)
Mutual Information (290)

Prerequisites from Prior Parts

ConceptWhere it appears in Part IXIntroduced
Probability distributionsch277–279, ch286Part VIII
Expected value & variancech273, ch276ch249–ch250
Bayes’ theoremch286ch246
Dot product, matrix multiplych280–281, ch292, ch294Parts V–VI
Gradient descentch281, ch295, ch296ch213
SVD / PCAch292ch173–174
Logarithmsch287–290ch043
Monte Carlo integrationch285ch224
Eigenvaluesch292, ch295ch172

If any row in that table feels vague, return to the cited chapter before proceeding.


Libraries Used in Part IX

Per the project rules: implement from scratch first, validate with the library second.

LibraryPurpose
numpyNumerical arrays throughout
matplotlibVisualization
pandasData loading and cleaning
scipy.statsValidation of statistical tests
sklearnValidation of ML models

Every core algorithm in chapters 271–299 is implemented in NumPy first. The library call appears only to verify correctness.


Motivating Problem: You Cannot Solve This Yet

The following cell generates a synthetic dataset and asks three questions. By the end of Part IX, you will be able to answer all three rigorously and implement the answers from scratch.

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)

# Dataset: 500 users, each with features and a label
n = 500
age          = rng.normal(35, 10, n).clip(18, 70)
session_time = 2.0 + 0.08 * age + rng.normal(0, 3, n)  # correlated with age
pages_viewed = rng.poisson(lam=4, size=n)
purchased    = ((0.3 * session_time + 0.1 * pages_viewed
                 - 0.02 * age + rng.normal(0, 1, n)) > 1.5).astype(int)

# Inject 30 missing values and 5 outliers
missing_idx = rng.choice(n, 30, replace=False)
session_time[missing_idx[:15]] = np.nan
pages_viewed = pages_viewed.astype(float)
pages_viewed[missing_idx[15:]] = np.nan
session_time[rng.choice(n, 5, replace=False)] = rng.uniform(50, 100, 5)

print(f"Dataset: {n} samples, {purchased.sum()} purchases ({purchased.mean():.1%} conversion)")
print(f"Missing session_time: {np.isnan(session_time).sum()}")
print(f"Missing pages_viewed: {np.isnan(pages_viewed).sum()}")
print()
print("Three questions you cannot yet answer rigorously:")
print()
print("Q1: Is session_time a statistically significant predictor of purchase?")
print("    (ch277–280: hypothesis testing, correlation)")
print()
print("Q2: Build a classifier. What is its true out-of-sample accuracy?")
print("    (ch281–284: regression, model evaluation, cross-validation)")
print()
print("Q3: Which feature carries the most information about the purchase label?")
print("    (ch287–290: information theory, mutual information)")
Dataset: 500 samples, 184 purchases (36.8% conversion)
Missing session_time: 15
Missing pages_viewed: 15

Three questions you cannot yet answer rigorously:

Q1: Is session_time a statistically significant predictor of purchase?
    (ch277–280: hypothesis testing, correlation)

Q2: Build a classifier. What is its true out-of-sample accuracy?
    (ch281–284: regression, model evaluation, cross-validation)

Q3: Which feature carries the most information about the purchase label?
    (ch287–290: information theory, mutual information)
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

for ax, (feat, label) in zip(axes, [
    (session_time, 'Session Time (min)'),
    (pages_viewed,  'Pages Viewed'),
    (age,           'Age'),
]):
    valid = ~np.isnan(feat)
    for val, color, lbl in [(0, '#4C72B0', 'No purchase'), (1, '#DD8452', 'Purchased')]:
        mask = valid & (purchased == val)
        ax.hist(feat[mask], bins=25, alpha=0.6, color=color, label=lbl, density=True)
    ax.set_xlabel(label)
    ax.set_ylabel('Density')
    ax.legend(fontsize=8)

fig.suptitle('Part IX Motivating Dataset — Can you predict purchase?', fontsize=13)
plt.tight_layout()
plt.show()

print("These distributions overlap. The signal is real but not obvious.")
print("Part IX gives you the tools to extract it rigorously.")
<Figure size 1400x400 with 3 Axes>
These distributions overlap. The signal is real but not obvious.
Part IX gives you the tools to extract it rigorously.

A Note on Rigor

Statistics is a field where sloppy thinking causes measurable real-world harm: bad drugs get approved, bad policies get enacted, bad models get deployed. The chapters ahead are not a survey. They are a systematic construction of the tools you need to think clearly under uncertainty.

Every formula has a derivation. Every algorithm has an implementation. Every claim can be verified computationally.

Proceed to ch271 — Data and Measurement.