Researchmodel evaluationllmanthropic
Researcher Releases Benchmark Revealing Model Nonsense
8.1
Relevance ScorePeter Gostev, AI capability lead at Arena, released BullshitBench in late February, a GitHub-hosted suite of deliberately nonsensical prompts that has attracted more than 1,200 stars. The benchmark tests whether large language models reject flawed premises; Google’s Gemini 3.0 failed to push back over half the time while Anthropic’s models rejected nonsense most often. The results highlight a gap between capability and judgment in current LLMs.
Scoring Rationale
Practical, industry-relevant benchmark with clear findings and broad applicability; limited by single-source reporting and informal, non-peer-reviewed validation.
Sources
- Read OriginalThis researcher has a new way to measure AI performance. It's BS, literally.businessinsider.com

