AI systemsPro

Replace “it feels better” with evidence.

AI Evals: Test, Measure & Ship LLM Apps

Build an evaluation system with failure analysis, deterministic and semantic scorers, aligned LLM judges, uncertainty, regression tests, and release gates.

Course map

One sequence, from foundation to finish.

8 connected modules

What you will be able to do

Leave with capability, not just vocabulary.

Build a failure taxonomy and representative eval set

Write deterministic and semantic scorers

Measure and align an LLM-as-judge

Add uncertainty-aware regression tests and CI gates

Running example

The Helpwell assistant, evaluated across LLM, RAG, and agent-style failures using baked outputs and live scoring machinery.

Prerequisites

Basic Python and familiarity with an LLM, RAG, or agent workflow.

Curriculum

Every module earns the next one.

Open any module to review its exact sections. Progress and completion follow you through the course.

8 modules · ~9 hours

Module 1

Beyond Vibes: What an Eval Actually Is

BeginnerFree preview

Topics include Why vibes & benchmarks fail, Input set + good + scorer, The eval flywheel, and more.

View 5 sections

1Why Vibes Don’t Scale (and Benchmarks Lie for Your Product)
2What an Eval Actually Is: Inputs, a Definition of Good, a Scorer
3The Eval Flywheel: Analyze, Measure, Improve, Re-measure
4Offline vs Online: a Closed Loop, Not a Choice
5The Scorer Ladder: a Map of the Course

60 min5 sections

Open module

Module 2

Cheap & Trustworthy: Deterministic Scorers as Classifiers

BeginnerPro

Topics include Exact / regex / normalize, Schema & value validity, Scorer = classifier, and more.

View 5 sections

1Deterministic Scorers: Exact, Regex, and Normalize First
2Structured-Output Validity: Parse, Schema, Constraints
3Your Scorer IS a Classifier: Confusion Matrix, Precision, Recall, F1
4Why Accuracy Lies on Imbalanced Eval Sets
5Picking the Operating Point: Which Error Is Expensive

70 min5 sections

Open module

Module 3

Words vs Meaning vs Running It: Reference, Semantic & Functional Scorers

IntermediatePro

Topics include ROUGE / BLEU / METEOR, Cosine & BERTScore, Threshold calibration, and more.

View 5 sections

1Reference-Overlap Metrics: ROUGE, BLEU, and Where They Mislead
2Reference-Free vs Reference-Based: Which Regime per Task
3Semantic Similarity: Cosine and BERTScore’s Token Match
4Calibrating a Similarity Threshold: No Universal 0.8
5Functional Checks and pass@k: Run It, Don’t Compare Text

75 min5 sections

Open module

Module 4

Look At Your Data: Error Analysis & Failure Taxonomies

IntermediatePro

Topics include Read your data, Open coding, Axial coding & counts, and more.

View 5 sections

1Error Analysis: Reading Your Data Is the Highest-ROI Activity
2Open Coding: Free-Text Notes on What Went Wrong
3Axial Coding: Group, Name, Count, Prioritize
4Theoretical Saturation: When to Stop Reading
5The Three Gulfs: Comprehension, Specification, Generalization

70 min5 sections

Open module

Module 5

Don’t Trust the Judge: Align It: LLM-as-Judge

IntermediatePro

Topics include When you need a judge, Binary beats 1–5, Align with kappa, and more.

View 5 sections

1When You Finally Need a Judge: the Last Rung
2Binary Pass/Fail Beats 1–5
3Align the Judge to Humans: Raw Agreement Lies, Use Kappa
4Judge Biases: Verbosity, Position, Self-Enhancement
5Iterate the Judge: Edit the Rubric, Re-measure Alignment

75 min5 sections

Open module

Module 6

Trustworthy Numbers: Noise, Confidence Intervals & Power

AdvancedPro

Topics include Eval noise & small-n, Wilson interval, Bootstrap CIs, and more.

View 5 sections

1One Number Is a Coin Flip: Eval Noise & Small-n
2The Wilson Interval for a Pass Rate
3Bootstrap Confidence Intervals for Any Metric
4What the Bootstrap Can and Can’t Do
5Sample Size & Power: How Many Examples Is Enough

70 min5 sections

Open module

Module 7

Comparing Versions & Gating Regressions

AdvancedPro

Topics include Paired comparison, McNemar’s test, Per-slice + corrections, and more.

View 5 sections

1Paired vs Unpaired: Same Eval Set Means Paired
2McNemar’s Test: Only the Disagreements Decide
3Permutation & Paired-Bootstrap: the Assumption-Free Comparison
4Per-Slice Analysis & Multiple Comparisons
5The CI Gate: Block a Deploy Only on a Real Regression

75 min5 sections

Open module

Module 8

The Whole System: RAG, Agent, Production & Benchmark Evals

AdvancedPro

Topics include Retrieval metrics, RAG generation evals, Agent & pass^k, and more.

View 5 sections

1Retrieval Evals: Recall@k, Precision@k, MRR, nDCG
2RAG Generation Evals: Faithfulness & Context Precision/Recall
3Agent Evals: Outcome vs Trajectory, and pass^k Reliability
4Online vs Offline: Monitoring, Drift, Guardrails, Cost
5Benchmark Literacy & Assembling Your Own Harness

80 min5 sections

Open module

Who this course is for

Built for people who need to use the skill.

AI engineers shipping LLM applications

Quality and platform teams

Technical leaders defining release standards

Continue learning

These are independent LDS courses that extend the closest skills. Choose the direction that matches what you want to build next.

8 modules

AI systems

Building RAG Systems & Vector Search

Build retrieval systems you can measure.

Intermediate to Advanced · ~9 hours

8 modules

AI systems

Building AI Agents

Build the machinery around an agent loop.

Intermediate to Advanced · ~9 hours

8 modules

AI systems

Production LLM Systems

Turn one LLM call into a dependable service.

Intermediate to Advanced · 8h 30m

Start the course

Begin with Beyond Vibes: What an Eval Actually Is.

The first module establishes the language and example used throughout the rest of the course.

Open Module 1

AI Evals: Test, Measure & Ship LLM Apps

One sequence, from foundation to finish.

Leave with capability, not just vocabulary.

Every module earns the next one.

Beyond Vibes: What an Eval Actually Is

Cheap & Trustworthy: Deterministic Scorers as Classifiers

Words vs Meaning vs Running It: Reference, Semantic & Functional Scorers

Look At Your Data: Error Analysis & Failure Taxonomies

Don’t Trust the Judge: Align It: LLM-as-Judge

Trustworthy Numbers: Noise, Confidence Intervals & Power

Comparing Versions & Gating Regressions

The Whole System: RAG, Agent, Production & Benchmark Evals

Built for people who need to use the skill.

Build on this course with a clear next step.

Building RAG Systems & Vector Search

Building AI Agents

Production LLM Systems

Begin with Beyond Vibes: What an Eval Actually Is.