Researchers Introduce Humanity's Last Exam Benchmark

A Nature study published Jan. 28 co-led by Phan Nguyen Hoang Long introduces Humanity’s Last Exam (HLE), a 2,500-question multimodal benchmark assessing expert-level reasoning of LLMs like Gemini, GPT-5.2, and Grok. Developed with contributions from more than 1,000 professors across 500+ institutions, HLE already informs model leaderboards and industry evaluations, highlighting current AI scores well below top-tier human experts (~90%).
Scoring Rationale
Peer-reviewed Nature benchmark with extensive expert and industry adoption, meriting highest impact despite potential ongoing calibration and model improvements.
Practice with real Logistics & Shipping data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Logistics & Shipping problems

