Researchbenchmarksllmmultimodalai safety
Researchers Introduce Humanity's Last Exam Benchmark
10.0
Relevance Score
A Nature study published Jan. 28 co-led by Phan Nguyen Hoang Long introduces Humanity’s Last Exam (HLE), a 2,500-question multimodal benchmark assessing expert-level reasoning of LLMs like Gemini, GPT-5.2, and Grok. Developed with contributions from more than 1,000 professors across 500+ institutions, HLE already informs model leaderboards and industry evaluations, highlighting current AI scores well below top-tier human experts (~90%).


