Researchers Introduce Humanity's Last Exam Benchmark

A study published this week in Nature introduces Humanity’s Last Exam, a 2,500-question benchmark designed to probe tasks current AI systems cannot solve. The global collaboration of nearly 1,000 experts found leading models scored below 9% initially, highlighting large capability gaps and prompting discussion about benchmarks' limits and the need for task-specific, real-world evaluation metrics.
Scoring Rationale
High novelty and broad scope from a Nature benchmark, but constrained by opinionated analysis and limited practical evaluation guidance.
Practice with real Logistics & Shipping data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Logistics & Shipping problems


