Analysisbenchmarksllmevaluation metrics
Researchers Introduce Humanity's Last Exam Benchmark
8.3
Relevance Score
A study published this week in Nature introduces Humanity’s Last Exam, a 2,500-question benchmark designed to probe tasks current AI systems cannot solve. The global collaboration of nearly 1,000 experts found leading models scored below 9% initially, highlighting large capability gaps and prompting discussion about benchmarks' limits and the need for task-specific, real-world evaluation metrics.



