Agents' Last Exam launches economically focused agent benchmark

Agents' Last Exam (ALE), led by Berkeley RDI with contributions from 300+ industry experts, is a living benchmark that measures AI agents on long-horizon, economically valuable tasks rather than abstract proxy problems, the project website and arXiv paper state. ALE currently covers 55 sub-industries and a public corpus of 1,500+ tasks toward a 5,000-task goal, according to the project site and coverage on GitHub. The open-source ale_run toolkit on GitHub exposes a sandboxed execution environment and 150 public reference tasks, per the repository. The arXiv paper reports that frontier agent configurations pass only 2.6% on the hardest "last-exam" tier. Project materials and external audits motivate deterministic graders, real OS sandboxes, and artifact-based scoring as defenses against score gaming.
What happened
Agents' Last Exam (ALE) is a new, open benchmark and evaluation framework for AI agents that prioritizes real, economically valuable professional workflows over narrow academic proxies, the project website and the arXiv paper (arXiv:2606.05405) report. The project is co-led by Berkeley RDI and a network of 300+ industry experts, and the public materials state coverage across 55 sub-industries with a current catalog of 1,500+ tasks and a longer-term target of 5,000 tasks. The GitHub repository provides the ale_run toolkit, a sandbox provisioning and grader harness, and 150 public reference tasks as the current export of the larger corpus. The arXiv paper reports that, across mainstream harness and backbone configurations, frontier agents average 2.6% pass rate on the hardest "last-exam" tier (26% overall across all task tiers).
Technical details
Per the project documentation on GitHub and the project website, ALE evaluates "generalist CUA-agents" that can interact with both terminal and graphical user interfaces. The framework runs agents in real OS sandboxes, lets agents execute unconstrained workflows from a single task description, and scores the final artifacts with deterministic, reproducible code evaluators rather than human adjudication. The repo describes a cross-OS CUA MCP bridge that lifts CLI-native harnesses to full desktop surfaces and lists example harnesses such as Claude Code, Codex, and Openclaw as supported inputs to the evaluation pipeline.
Technical context
Benchmarks that reward short-lived proxies or instrument-level behaviors tend to be exploitable, producing inflated scores that do not translate to real-world utility. Public audits and an RDI blog summarizing audits document concrete failure modes in eight prominent agent benchmarks, including simple techniques that extract gold answers or monkey-patch graders to score highly without performing the underlying work. ALE's emphasis on sandboxed execution, artifact-based grading, and verifiable tasks reflects a defensive design pattern intended to reduce those exploit vectors. For practitioners, this implies evaluation should prioritize end-to-end, outcome-focused testing when the deployment target is economically meaningful work rather than model- or API-level competence.
Context and significance
The arXiv results showing a 2.6% pass rate on ALE's hardest tier are a sobering empirical indicator that top-line benchmark wins have not yet yielded broadly deployable agent competence on complex professional workflows. Coverage by VentureBeat notes that GPT-5.5 outperformed Claude Fable 5 on ALE, highlighting that results differ from those on other benchmarks. Because ALE maps tasks to the U.S. occupational taxonomy (O*NET / SOC), it explicitly links evaluation outcomes to GDP-relevant activities, which could change how researchers and vendors claim progress on utility. Observers who follow benchmark design note that moving from synthetic question-answer tasks to multi-step, artifact-scored workflows raises the bar for both model capability and system engineering around tool integration, memory, and robust stateful action loops.
What to watch
For practitioners: monitor these adoption and validity signals over the next 6-12 months:
- •whether major foundation-model vendors publish runs on ALE or integrate ale_run into internal validation pipelines;
- •evidence that ALE's deterministic graders resist the documented exploit techniques that broke prior benchmarks, measured by third-party audits;
- •expansion of the public task pool beyond the 150 reference tasks and updates to the corpus toward the 5,000-task ambition;
- •reproducibility reports and competitive results that correlate ALE performance with real deployment outcomes in pilot projects.
Overall, ALE represents a deliberate shift in benchmark design toward outcome-oriented, economically grounded evaluation. The project is open-source and accompanied by an arXiv paper documenting initial results and rationale, and it arrives amid public audits showing widespread exploitability in earlier agent benchmarks. Practitioners should treat ALE as an experimentally stricter standard for end-to-end agent utility rather than a simple new leaderboard.
Scoring Rationale
ALE introduces a broader, outcome-focused benchmark mapping to real occupations and exposes a large capability gap (2.6% pass on hardest tier). This could materially shift evaluation practices for agent research and deployment, but it is an early, community-driven effort whose influence depends on adoption.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems


