Researchmodel evaluationllmanthropic

Researcher Releases Benchmark Revealing Model Nonsense

|March 25, 2026

8.1

Relevance Score

Researcher Releases Benchmark Revealing Model Nonsense — Photo: i.insider.com · rights & takedowns

Peter Gostev, AI capability lead at Arena, released BullshitBench in late February, a GitHub-hosted suite of deliberately nonsensical prompts that has attracted more than 1,200 stars. The benchmark tests whether large language models reject flawed premises; Google’s Gemini 3.0 failed to push back over half the time while Anthropic’s models rejected nonsense most often. The results highlight a gap between capability and judgment in current LLMs.

Scoring Rationale

Practical, industry-relevant benchmark with clear findings and broad applicability; limited by single-source reporting and informal, non-peer-reviewed validation.

MoreAnthropic news