A module-by-module concept outline. Open the course to learn each topic with animated explanations, in-browser code, practice challenges, and a knowledge check.

Module 1. Tokens — The Words a Model Sees

Topics

BPEVocabularyGlitch tokensTokenizer comparison

Sections

1Why Tokens Exist
2Byte-Pair Encoding (BPE) — The Algorithm
3Vocabulary, Special Tokens, and the Pre-Tokenizer
4Glitch Tokens and Tokenizer Fragility
5Tokenizer Choice in 2026 — Tiktoken vs SentencePiece vs Llama

Module 2. Embeddings & Positional Encoding

Topics

Word embeddingsSinusoidal positionsRoPEALiBiNoPE

Sections

1Tokens Become Vectors
2The Embedding Table — A Lookup, Not a Computation
3Why Order Matters — The Need for Positional Encoding
4Sinusoidal Positions — The Original Trick
5Rotary Position Embedding (RoPE) — The Modern Default

Module 3. Attention is Information Routing

Topics

Q/K/VScaled dot-productMulti-headCausal maskResidual stream

Sections

1The Intuition — Tokens Reading From Each Other
2Q, K, V — Three Lenses on the Same Vector
3The Scaled Dot-Product Attention Formula
4Multi-Head Attention — Subspace Specialization
5Causal Masking and the Residual Stream

Module 4. The Transformer Block

Topics

LayerNorm vs RMSNormSwiGLU FFNPre-normGQA / MLAMixture of Experts

Sections

1Anatomy of One Block
2Normalization — LayerNorm, RMSNorm, and Why Order Matters
3The Feed-Forward Network — GELU, SwiGLU, and Width
4Inference-Efficient Attention — MQA, GQA, and MLA
5Mixture of Experts — Routing for Scale

Module 5. Training at Scale

Topics

Next-token lossAdamWScaling lawsFSDP / ZeROMixed precision

Sections

1Next-Token Prediction and Cross-Entropy Loss
2Optimization — AdamW, LR Schedules, Gradient Clipping
3Scaling Laws — Kaplan to Chinchilla and Beyond
42026 Reality — Deliberate Over-Training for Inference Economics
5Distributed Training — FSDP / ZeRO and Mixed Precision

Module 6. Inference & Decoding

Topics

Greedy vs samplingTemperatureTop-k / top-p / min-pKV cacheTest-time compute

Sections

1From Logits to Tokens — The Inference Loop
2Sampling Strategies — Temperature, Top-k, Top-p, Min-p
3The KV Cache — Why Inference is Memory-Bound
4Speculative Decoding and Other Production Tricks
5Test-Time Compute — Sequential CoT and Best-of-N

Module 7. Post-Training — SFT, DPO, GRPO

Topics

SFTDPOGRPO / RLVRReasoning modelso1 / R1

Sections

1Why Pre-Training Is Not Enough
2Supervised Fine-Tuning (SFT) — Instruction Following
3Direct Preference Optimization (DPO) — RLHF Without the RL
4Group Relative Policy Optimization (GRPO) and RLVR
5Reasoning Models — o1, R1, and Emergent Chain-of-Thought

Module 8. Looking Inside & The Frontier

Topics

Mechanistic interpretabilityInduction headsSparse autoencodersEvaluationFrontier

Sections

1The Residual Stream as Shared Memory
2Induction Heads — The First Real Circuit
3Sparse Autoencoders and Monosemantic Features
4Evaluation — MMLU, HumanEval, lm-eval-harness, Holistic Eval
5The Frontier — Mamba, World Models, and What Comes Next

LLM Foundations

What this course covers

Module 1. Tokens — The Words a Model Sees

Module 2. Embeddings & Positional Encoding

Module 3. Attention is Information Routing

Module 4. The Transformer Block

Module 5. Training at Scale

Module 6. Inference & Decoding

Module 7. Post-Training — SFT, DPO, GRPO

Module 8. Looking Inside & The Frontier

Ready to start LLM Foundations?

All course syllabi