Researchagentsconsultingmercorbenchmarks
Mercor Finds AI Agents Fail Consulting Tasks
6.1
Relevance ScoreMercor published the APEX-Agents benchmark showing leading AI agents completed under 25% of real-world consulting, banking, and legal tasks on the first try and only about 40% after eight attempts; OpenAI's GPT-5.2 initially completed roughly 23% while Anthropic's Opus 4.6 reached nearly 33%. The study found agents perform well at research and single-tool data analysis but fail on long-horizon, multi-step planning and cross-file coordination, and Mercor CEO Brendan Foody says rapid model improvement could displace some consulting roles soon.



