Vals AI Launches Excel Modeling Benchmark For Finance Agents
Vals AI released the Excel Modeling Benchmark (EMB) on July 1, 2026, testing 17 AI agents on building the complex LBO, DCF, and M&A models used in investment banking and private equity, graded against expert-authored gold-standard spreadsheets. Claude Opus 4.8 topped the leaderboard at 69.4% accuracy, ahead of Claude Sonnet 5 (66.3%) and GPT-5.5 (64.5%), but the benchmark's more telling finding is where models fail: the top model passes 87% of formula-structure checks yet only 61% of number checks, meaning agents build structurally sound spreadsheets whose computed values still drift from the reference answer. Vals AI also tracked cost efficiency: Opus 4.8 hits its top score at roughly $12 per task, Claude Sonnet 5 costs more ($15.44) for a lower score, and MiMo V2.5 Pro stays above 50% accuracy at just $0.22 per task.
Most LLM benchmarks reward fluent, plausible-looking output, which is close to the wrong incentive for financial modeling, where a single mis-referenced cell can silently misprice a deal without ever looking wrong. The real value in Vals AI's new Excel Modeling Benchmark (EMB) is not the leaderboard order, it is the split between formula checks and number checks, which isolates exactly this failure mode for the first time in a widely-referenced benchmark: a spreadsheet can be built correctly and still compute the wrong answer.
What happened
According to Vals AI, EMB, released July 1, 2026, grades AI agents on constructing full LBO, DCF, and M&A models against expert-authored gold-standard spreadsheets, rather than checking whether output merely looks reasonable. Across 17 models tested, Claude Opus 4.8 led at 69.4% accuracy, followed by Claude Sonnet 5 at 66.3% and GPT-5.5 at 64.5%. Vals AI also published per-task cost figures alongside accuracy: Opus 4.8 reaches its top score at roughly $12 per task, Claude Sonnet 5 costs more at $15.44 despite scoring lower, and lower-cost models like MiMo V2.5 Pro stay above 50% accuracy at just $0.22 per task.
Technical context
The more diagnostic number in the release is the gap between formula-structure accuracy and number accuracy: the top-scoring model passed 87% of formula-structure checks but only 61% of number checks. That gap means the model is learning the shape of a correct model, the right cell references, the right structure, without reliably arriving at the right computed values, the two are not the same skill and current agents are stronger at the former than the latter.
For practitioners
For teams evaluating agents for finance workflows, EMB is a useful corrective to benchmarks that reward surface fluency: a model can pass most structural checks and still be unsafe to use unsupervised on a live deal model, since a wrong number carries the same downstream consequence whether the formula around it looks correct or not. The cost-versus-accuracy spread is also notable for procurement: Claude Sonnet 5 costs more per task than the higher-scoring Opus 4.8, so teams optimizing on sticker price alone could end up paying more for a less numerically reliable model. As with any single-source benchmark release, these figures are Vals AI's own reported results; independent replication has not yet been published.
What to watch
EMB is a private benchmark, meaning the full question set is not public, which limits independent verification of the specific scores even as the formula-versus-number distinction itself is a useful framework other evaluators could adopt. Worth tracking: whether Vals AI or others publish a public subset for reproducibility, whether next-generation models close the number-check gap faster than the formula-check gap, and whether investment banks and PE firms piloting AI-assisted modeling report similar failure patterns in production rather than benchmark conditions.
Key Points
- 1Vals AI launched the Excel Modeling Benchmark on July 1, grading 17 AI agents on building LBO, DCF, and M&A models against expert answers.
- 2The top model, Claude Opus 4.8, passed 87% of formula-structure checks but only 61% of number checks, exposing a computed-value reliability gap.
- 3Claude Opus 4.8 led at 69.4% accuracy and lowest cost among top scorers, while Claude Sonnet 5 scored lower yet cost more per task.
Scoring Rationale
A methodologically useful, decision-relevant benchmark for practitioners choosing agents for numerically-precise finance work, with a genuinely novel formula-vs-number diagnostic. Score trimmed slightly from 6.0 to reflect that this remains single-sourced (Vals AI's own private benchmark, no independent press or replication found despite multiple searches) and the underlying question set is not public.
Sources
Public references used for this report.
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

