Skip to content

LLM Foundations

Free

How large language models actually work — tokenization, embeddings and positional encoding, attention, the transformer block, training, inference, and interpretability.

8 modules · Free with a (free) account.

View the full course

What this course covers

A module-by-module concept outline. Open the course to learn each topic with animated explanations, in-browser code, practice challenges, and a knowledge check.

Module 1. Tokens — The Words a Model Sees

Topics
BPEVocabularyGlitch tokensTokenizer comparison
Sections
  1. 1Why Tokens Exist
  2. 2Byte-Pair Encoding (BPE) — The Algorithm
  3. 3Vocabulary, Special Tokens, and the Pre-Tokenizer
  4. 4Glitch Tokens and Tokenizer Fragility
  5. 5Tokenizer Choice in 2026 — Tiktoken vs SentencePiece vs Llama

Module 2. Embeddings & Positional Encoding

Topics
Word embeddingsSinusoidal positionsRoPEALiBiNoPE
Sections
  1. 1Tokens Become Vectors
  2. 2The Embedding Table — A Lookup, Not a Computation
  3. 3Why Order Matters — The Need for Positional Encoding
  4. 4Sinusoidal Positions — The Original Trick
  5. 5Rotary Position Embedding (RoPE) — The Modern Default

Module 3. Attention is Information Routing

Topics
Q/K/VScaled dot-productMulti-headCausal maskResidual stream
Sections
  1. 1The Intuition — Tokens Reading From Each Other
  2. 2Q, K, V — Three Lenses on the Same Vector
  3. 3The Scaled Dot-Product Attention Formula
  4. 4Multi-Head Attention — Subspace Specialization
  5. 5Causal Masking and the Residual Stream

Module 4. The Transformer Block

Topics
LayerNorm vs RMSNormSwiGLU FFNPre-normGQA / MLAMixture of Experts
Sections
  1. 1Anatomy of One Block
  2. 2Normalization — LayerNorm, RMSNorm, and Why Order Matters
  3. 3The Feed-Forward Network — GELU, SwiGLU, and Width
  4. 4Inference-Efficient Attention — MQA, GQA, and MLA
  5. 5Mixture of Experts — Routing for Scale

Module 5. Training at Scale

Topics
Next-token lossAdamWScaling lawsFSDP / ZeROMixed precision
Sections
  1. 1Next-Token Prediction and Cross-Entropy Loss
  2. 2Optimization — AdamW, LR Schedules, Gradient Clipping
  3. 3Scaling Laws — Kaplan to Chinchilla and Beyond
  4. 42026 Reality — Deliberate Over-Training for Inference Economics
  5. 5Distributed Training — FSDP / ZeRO and Mixed Precision

Module 6. Inference & Decoding

Topics
Greedy vs samplingTemperatureTop-k / top-p / min-pKV cacheTest-time compute
Sections
  1. 1From Logits to Tokens — The Inference Loop
  2. 2Sampling Strategies — Temperature, Top-k, Top-p, Min-p
  3. 3The KV Cache — Why Inference is Memory-Bound
  4. 4Speculative Decoding and Other Production Tricks
  5. 5Test-Time Compute — Sequential CoT and Best-of-N

Module 7. Post-Training — SFT, DPO, GRPO

Topics
SFTDPOGRPO / RLVRReasoning modelso1 / R1
Sections
  1. 1Why Pre-Training Is Not Enough
  2. 2Supervised Fine-Tuning (SFT) — Instruction Following
  3. 3Direct Preference Optimization (DPO) — RLHF Without the RL
  4. 4Group Relative Policy Optimization (GRPO) and RLVR
  5. 5Reasoning Models — o1, R1, and Emergent Chain-of-Thought

Module 8. Looking Inside & The Frontier

Topics
Mechanistic interpretabilityInduction headsSparse autoencodersEvaluationFrontier
Sections
  1. 1The Residual Stream as Shared Memory
  2. 2Induction Heads — The First Real Circuit
  3. 3Sparse Autoencoders and Monosemantic Features
  4. 4Evaluation — MMLU, HumanEval, lm-eval-harness, Holistic Eval
  5. 5The Frontier — Mamba, World Models, and What Comes Next

Ready to start LLM Foundations?

Free with a (free) account — sign in and start learning.

Go to the course