MathVista Shows Models Falling Behind Human Math

Researchers from Microsoft Research, Sahara AI, and Emory University this week released results from MATHVISTA, a multimodal math-reasoning benchmark built with more than 6,000 Sahara AI-annotated examples and launched in October 2023. Across 12 foundation models tested, GPT-4V scored 49.9% compared with a 60.3% human average, revealing a substantial gap in visual math reasoning. Authors say improving AGI progress depends more on better training and evaluation data than model scale.
Key Points
- 1Reports show GPT-4V tops 12 models at 49.9%, humans average 60.3%
- 2Highlights multimodal visual math remains challenging; prior benchmarks often allowed text-only shortcuts
- 3Suggests focusing on higher-quality training and evaluation data, simulated environments, and human annotators
Scoring Rationale
High-quality, widely available benchmark and credible collaborators raise impact, but scope is task-specific and not paradigm-shifting.
Sources
Public references used for this report.
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems
