The Anti-Defamation League published a study Wednesday evaluating six large language models — Anthropic Claude, OpenAI ChatGPT, Meta Llama, Google Gemini, DeepSeek, and xAI Grok — on handling anti-Jewish, anti-Zionist, and extremist prompts across 4,181 chats per model (over 25,000 chats) between August and October 2025. Claude scored highest (80) while Grok scored lowest (21), revealing substantial moderation gaps and multimodal weaknesses, especially in image and document analysis.

Key Points

1Ranked six LLMs; Claude scored 80 and Grok scored 21, a 59-point performance gap
2Showed consistent weaknesses in multi-turn dialogue and image/document analysis reducing moderation effectiveness
3Indicates developers and vendors must improve multimodal safety, context retention, and bias detection for deployment

Scoring Rationale

Robust empirical evaluation across six major LLMs provides strong evidence, limited by lack of novel mitigation guidance.

MoreAnthropic news

Sources

Public references used for this report.

2 sources

01theverge.comGrok is the most antisemitic chatbot according to the ADL

02upi.comxAI's Grok worst performing platform on countering anti-Semitism

Practice with real Logistics & Shipping data

90 SQL & Python problems · 15 industry datasets

Used by DS/ML engineers at top companies

High-Value Overnight OrdersEasy

Delivered International ShipmentsMedium

On-Time Delivery Rate by CarrierHard

250 free problems · No credit card

See all Logistics & Shipping problems

Key Points

1Ranked six LLMs; Claude scored 80 and Grok scored 21, a 59-point performance gap
2Showed consistent weaknesses in multi-turn dialogue and image/document analysis reducing moderation effectiveness
3Indicates developers and vendors must improve multimodal safety, context retention, and bias detection for deployment