Models & Researchevaluationai researchai agentsagentic ai

PACE Estimates Agent Scores From Proxy Benchmarks

|July 5, 2026|By LDS Team

6.6

Relevance Score

PACE Estimates Agent Scores From Proxy Benchmarks — Photo: opengraph.githubassets.com · rights & takedowns

The PACE paper submitted to arXiv on July 2, 2026 proposes using compact proxy benchmarks to estimate expensive agentic benchmark scores before teams run full evaluations. The authors report tests across 14 models, four agentic benchmarks, and 19 non-agentic benchmark pools, with PACE-Bench predicting agentic scores at under 4% leave-one-out mean absolute error, above 0.80 Spearman correlation, and around 85% pairwise ranking accuracy. The reported cost is less than 1% of a full agentic evaluation. For practitioners, the useful signal is triage: proxy evals can help narrow model, routing, or tool-policy candidates, but the paper does not prove that compact subsets can replace production-grade agent tests on reliability, tool failures, or long-horizon behavior.

PACE is useful to evaluation teams because it treats agent benchmarking as a budgeted engineering problem, not just a leaderboard exercise. The practical takeaway is that proxy evaluations may help teams screen model and routing candidates before spending days and thousands of dollars on full agentic benchmarks, while still reserving full runs for final validation.

What happened

The arXiv paper submitted on July 2, 2026 introduces PACE, short for Proxy for Agentic Capability Evaluation. The method selects compact subsets of cheaper non-agentic benchmark instances, then fits a regression that maps those scores to target agentic benchmark scores. The paper evaluates four target agentic benchmarks and reports experiments across 14 models and 19 non-agentic benchmark pools. According to the authors, PACE-Bench reaches leave-one-out mean absolute error under 4%, Spearman correlation above 0.80, and pairwise model-ranking accuracy around 85%, at less than 1% of full agent-evaluation cost.

Technical context

The accompanying GitHub repository and Hugging Face dataset make the work more useful than a paper-only proposal. The repo ships the prediction pipeline and reproduction commands, while the dataset page describes PACE-Bench as selected source instances for GAIA, SWE-Bench Verified, SWE-Bench Multimodal, and SWT-Bench. That matters because proxy evaluation is only credible if teams can inspect which instances were selected, how weights are used, and how a new model would be scored.

For practitioners

PACE should be read as a triage tool, not a replacement for full agent tests. A proxy can help narrow which models, tool policies, or agent configurations deserve expensive end-to-end evaluation. It cannot by itself prove robustness on real repositories, tool failures, multimodal edge cases, or long-horizon tasks where small environment differences change outcomes.

What to watch

The next signal is independent replication on newer models and production-like agent stacks. Teams should also watch coverage: the GitHub instructions note that faithful prediction requires scoring the selected instances, and missing coverage can pull estimates toward the training-set mean. If PACE-style proxies spread, the operational question will be whether they shorten evaluation loops without hiding the failures that full agent benchmarks were designed to expose.

Key Points

1PACE maps small atomic-evaluation subsets to costly agent benchmarks, aiming to estimate agent performance without full benchmark runs.
2The paper reports tests across 14 models, four agentic benchmarks, and 19 non-agentic benchmark pools.
3For model selection teams, cheaper proxy evals could support earlier routing decisions before expensive end-to-end agent tests.

Scoring Rationale

PACE targets a real evaluation-cost bottleneck for agent teams and has public paper, code, and dataset artifacts that make the result inspectable. The score stays notable rather than major because this is a new research proposal with limited independent adoption, and proxy evals should be treated as triage rather than replacements for full agent benchmarks.

MoreAI Evals news