Researcher Releases Benchmark Revealing Model Nonsense
Peter Gostev, AI capability lead at Arena, released BullshitBench in late February, a GitHub-hosted suite of deliberately nonsensical prompts that has attracted more than 1,200 stars. The benchmark tests whether large language models reject flawed premises; Google’s Gemini 3.0 failed to push back over half the time while Anthropic’s models rejected nonsense most often. The results highlight a gap between capability and judgment in current LLMs.
Scoring Rationale
Practical, industry-relevant benchmark with clear findings and broad applicability; limited by single-source reporting and informal, non-peer-reviewed validation.
Practice with real Logistics & Shipping data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Logistics & Shipping problems

