Models & Researchbenchmarkingagent evaluationrdi berkeleyopen source

Agents' Last Exam launches economically focused agent benchmark

|June 15, 2026|By LDS Team

7.7

Relevance Score

Agents' Last Exam launches economically focused agent benchmark

Agents' Last Exam (ALE) matters less as a new leaderboard than as evidence that today's agent benchmarks may be measuring the wrong thing: frontier models pass only about 2.6 percent of tasks on ALE's hardest tier and roughly a quarter overall, and the ranking flips leaders from other benchmarks, with GPT-5.5 beating Claude Fable 5 on ALE despite Fable 5 leading SWE-Bench Pro and Humanity's Last Exam. Led by Berkeley RDI with 300+ industry experts and partners including Goldman Sachs, JPMorgan, and Adobe, ALE maps tasks to the U.S. O*NET/SOC occupational taxonomy across 55 sub-industries, grading real software workflows (3D modeling in Siemens NX, VFX in After Effects, neuroimaging in FSLeyes) with deterministic, artifact-based scoring rather than human judgment. The design responds directly to a companion Berkeley RDI audit showing eight prominent agent benchmarks, including SWE-bench and WebArena, could be scored near-100 percent by an agent that solved zero tasks. For practitioners, the message is that benchmark choice materially changes which model looks best, so evaluation for economically valuable deployments should use outcome-verified, task-specific tests rather than generic leaderboard rank.

Agents' Last Exam (ALE) is best understood as a rebuttal to benchmark inflation, not just another leaderboard. The same UC Berkeley RDI group behind ALE published a companion study in April 2026 showing that eight of the most cited agent benchmarks, including SWE-bench Verified, SWE-bench Pro, WebArena, and GAIA, could be scored at 73 to 100 percent by an automated exploit agent that solved zero underlying tasks, by trojanizing test binaries, reading leaked gold answers, or exploiting weak string matching. ALE's design choices, real OS sandboxes, deterministic artifact-based grading, and hidden reference answers, are a direct response to that failure mode, and its initial results are correspondingly sobering: across mainstream harness and model configurations, frontier agents average just 2.6 percent on ALE's hardest tier and roughly a quarter across all tiers combined.

What ALE tests

ALE, led by Berkeley RDI in partnership with the RDI Foundation, Snorkel AI, and more than 300 industry experts, maps evaluation tasks to the U.S. O*NET/SOC occupational taxonomy across 55 professional sub-industries grouped into 13 clusters, with a catalog of more than 1,500 tasks (147 released publicly on GitHub) toward a longer-term goal of 5,000. Rather than abstract question-answering, agents complete real professional software workflows: 3D modeling in Siemens NX, motion and VFX work in Adobe After Effects, scene setup in Unreal Engine, mold-flow simulation in Moldex3D, architectural modeling in Rhino, and brain-imaging segmentation in FSLeyes, graded against verifiable, artifact-based outcomes rather than human judgment. Academic partners span MIT, Harvard, Stanford, and Oxford, alongside industry contributors including Goldman Sachs, JPMorgan, Morgan Stanley, Adobe, and Oracle, per the project's own site.

The result that matters

VentureBeat reports that OpenAI's GPT-5.5, run through the Codex harness, took the top spot on ALE's leaderboard with a 24 percent pass rate, edging out Claude Fable 5 in third place at 22 percent, an upset given that Fable 5 leads GPT-5.5 on benchmarks like SWE-Bench Pro and Humanity's Last Exam. That reversal is itself a finding: it shows that relative model rankings are not stable across benchmark design, and a model optimized for one class of tasks, such as terminal-based coding, does not automatically transfer its edge to long-horizon, GUI-heavy professional workflows.

For practitioners

Teams selecting agents for economically valuable deployments, not coding-assistant use cases alone, should treat generic leaderboard position as a weak proxy and instead test candidate models directly against the workflow category they plan to automate. ALE's open ale_run toolkit and reference harnesses (the official Claude Code CLI and an in-tree OpenClaw harness) offer a template for that kind of outcome-verified, sandboxed evaluation, and its rolling six-month task refresh is designed to limit the answer-leakage and benchmark-memorization problems documented in the companion audit.

What to watch

Track whether additional frontier labs publish their own ALE runs, whether the public task pool grows toward the 5,000-task goal without leaking into training data, and whether ALE's deterministic graders hold up against the same categories of exploit, shared agent/evaluator environments, weak string matching, unsanitized LLM judges, that the RDI team used to break eight other benchmarks.

Key Points

1Berkeley RDI's Agents' Last Exam benchmark found frontier AI agents pass only 2.6 percent of tasks on its hardest tier, about 24 percent overall.
2ALE was built after RDI researchers showed eight major benchmarks, including SWE-bench, could score near 100 percent without solving any task.
3Benchmark choice changes rankings: GPT-5.5 beat Claude Fable 5 on ALE despite trailing it elsewhere, so practitioners should test on outcome-verified, real workflows.

Scoring Rationale

Verification confirms Berkeley RDI's legitimacy (fetched the official project site, GitHub repo, and companion benchmark-exploit research, all consistent), corrected the public task count to 147 (matching GitHub and Snorkel AI, versus a previously stated 150), and confirmed the GPT-5.5-over-Fable-5 leaderboard result via VentureBeat. The benchmark's credibility (300+ experts, named academic and industry partners, a rigorous anti-gaming design responding to documented benchmark exploits) and its sobering 2.6% hardest-tier pass rate support keeping the score at 7.7; broader field adoption, not yet observed, would be needed to justify moving into the industry-shaking tier.

MoreOpen-Source AI news

Sources

Public references used for this report.

11 sources

agents-last-exam.orgAI Agent Benchmark for Real-World Professional Workflows

arxiv.orgAgents' Last Exam

github.comGitHub - rdi-berkeley/agents-last-exam

View 8 more sources

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems

Models & Researchbenchmarkingagent evaluationrdi berkeleyopen source