Researchllmbenchmarkingexpert curationopenai

Researchers Release Humanity's Last Exam Benchmark

|February 3, 2026|By LDS Team

8.9

Relevance Score

Researchers Release Humanity's Last Exam Benchmark — Photo: singularityhub.com · rights & takedowns

An international consortium released Humanity's Last Exam (HLE) in early 2025, a 2,500-question, expert-vetted benchmark covering math, humanities, and natural sciences to assess large language models. The test contains expert-crafted short-answer and multiple-choice items designed to be non-ambiguous and difficult for models; leading systems initially scored in the single digits, with GPT-5 reaching about 25 percent. HLE aims to track AI expertise, though it measures task performance rather than general intelligence.

Key Points

1Release 2,500 expert-vetted questions across disciplines to benchmark LLM problem-solving
2Expose model limitations: leading models initially scored single digits, showing persistent failure on graduate-level problems
3Encourage developers to improve generalization and avoid overfitting by withholding many items from public release

Scoring Rationale

High novelty and industry-wide relevance, but limited by curated short-answer focus and potential test overfitting.

MoreOpenAI news

Sources

Public references used for this report.

1 source

01singularityhub.comHumanity’s Last Exam Stumps Top AI Models—and That’s a Good Thing

Practice with real Logistics & Shipping data

90 SQL & Python problems · 15 industry datasets

Used by DS/ML engineers at top companies

High-Value Overnight OrdersEasy

Delivered International ShipmentsMedium

On-Time Delivery Rate by CarrierHard

250 free problems · No credit card

See all Logistics & Shipping problems

Researchllmbenchmarkingexpert curationopenai

Researchers Release Humanity's Last Exam Benchmark

|February 3, 2026|By LDS Team

8.9

Relevance Score

Key Points

1Release 2,500 expert-vetted questions across disciplines to benchmark LLM problem-solving
2Expose model limitations: leading models initially scored single digits, showing persistent failure on graduate-level problems
3Encourage developers to improve generalization and avoid overfitting by withholding many items from public release

Scoring Rationale

High novelty and industry-wide relevance, but limited by curated short-answer focus and potential test overfitting.

MoreOpenAI news

Sources

Public references used for this report.

1 source

01singularityhub.comHumanity’s Last Exam Stumps Top AI Models—and That’s a Good Thing

Practice with real Logistics & Shipping data

90 SQL & Python problems · 15 industry datasets

Used by DS/ML engineers at top companies

High-Value Overnight OrdersEasy

Delivered International ShipmentsMedium

On-Time Delivery Rate by CarrierHard

250 free problems · No credit card

See all Logistics & Shipping problems

Researchers Release Humanity's Last Exam Benchmark

Key Points

Scoring Rationale

Sources

More AI & Data Science News

AI inference market bifurcates between commodity and frontier models

Hyundai Sees Robotics Momentum Amid Sales Decline

LG EXAONE Demonstrates Industrial AI Applications at ICML

Huawei Plans Korea Launch for Ascend AI Chips

Researchers Release Humanity's Last Exam Benchmark

Key Points

Scoring Rationale

Sources

More AI & Data Science News

AI inference market bifurcates between commodity and frontier models

Hyundai Sees Robotics Momentum Amid Sales Decline

LG EXAONE Demonstrates Industrial AI Applications at ICML

Huawei Plans Korea Launch for Ascend AI Chips