The Anti-Defamation League published a study Wednesday evaluating six large language models — Anthropic Claude, OpenAI ChatGPT, Meta Llama, Google Gemini, DeepSeek, and xAI Grok — on handling anti-Jewish, anti-Zionist, and extremist prompts across 4,181 chats per model (over 25,000 chats) between August and October 2025. Claude scored highest (80) while Grok scored lowest (21), revealing substantial moderation gaps and multimodal weaknesses, especially in image and document analysis.
Key Points
- 1Ranked six LLMs; Claude scored 80 and Grok scored 21, a 59-point performance gap
- 2Showed consistent weaknesses in multi-turn dialogue and image/document analysis reducing moderation effectiveness
- 3Indicates developers and vendors must improve multimodal safety, context retention, and bias detection for deployment
Scoring Rationale
Robust empirical evaluation across six major LLMs provides strong evidence, limited by lack of novel mitigation guidance.
Sources
Public references used for this report.
Practice with real Logistics & Shipping data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Logistics & Shipping problems


