Researchmultimodalmath reasoningbenchmarkssahara ai

MathVista Shows Models Falling Behind Human Math

|March 18, 2026|By LDS Team

9.1

Relevance Score

MathVista Shows Models Falling Behind Human Math — Photo: cdn.decrypt.co · rights & takedowns

Researchers from Microsoft Research, Sahara AI, and Emory University this week released results from MATHVISTA, a multimodal math-reasoning benchmark built with more than 6,000 Sahara AI-annotated examples and launched in October 2023. Across 12 foundation models tested, GPT-4V scored 49.9% compared with a 60.3% human average, revealing a substantial gap in visual math reasoning. Authors say improving AGI progress depends more on better training and evaluation data than model scale.

Key Points

1Reports show GPT-4V tops 12 models at 49.9%, humans average 60.3%
2Highlights multimodal visual math remains challenging; prior benchmarks often allowed text-only shortcuts
3Suggests focusing on higher-quality training and evaluation data, simulated environments, and human annotators

Scoring Rationale

High-quality, widely available benchmark and credible collaborators raise impact, but scope is task-specific and not paradigm-shifting.

MoreAI Evals news

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems

MathVista Shows Models Falling Behind Human Math

Key Points

Scoring Rationale

More AI & Data Science News

Microsoft says Copilot weekly engagement is on par with Outlook and Teams

Moonshot Releases Kimi K3 Weights and Report

Saskatoon Reviews Data Centre Tradeoffs

Anthropic Restores Claude After Multi-Model Outage

MathVista Shows Models Falling Behind Human Math

Key Points

Scoring Rationale

More AI & Data Science News

Microsoft says Copilot weekly engagement is on par with Outlook and Teams

Moonshot Releases Kimi K3 Weights and Report

Saskatoon Reviews Data Centre Tradeoffs

Anthropic Restores Claude After Multi-Model Outage