Security & Riskbias evaluationchatbotsmodel auditllms

Chatbots exhibit alleged political bias in major platforms

||By LDS Team
5.8
Relevance Score
Chatbots exhibit alleged political bias in major platforms
Photo: nypost.com · rights & takedowns

The Washington Post's June 24 empirical analysis found ChatGPT gave exclusively left-leaning political answers 80% of the time, Claude gave left-leaning responses 43% and balanced answers 57%, and Gemini was most balanced at 93% both-sides - concrete benchmark figures for AI evaluation teams. The methodology used questions from a 2025 Stanford-Dartmouth framework, 30-word capped responses, and human scoring. A June 29 New York Post editorial amplified the findings, adding that Claude reportedly declined to flag extreme statements from a House candidate's social-media feed while complying with an equivalent request about President Trump, per the NYPost. For practitioners, the WaPo results establish political-axis testing as a measurable evaluation dimension, and the asymmetric compliance example points to a second audit vector beyond directional bias scoring.

For AI practitioners, a quantitative study finding systematic left-skew in the most widely deployed commercial chatbots - especially at the 80% rate measured for ChatGPT - is an evaluation design signal: political-axis testing belongs in standard model QA alongside toxicity and hallucination metrics.

What the Study Found - Reported facts: The Washington Post published an interactive analysis on June 24, 2026, testing AI models behind ChatGPT, Gemini, Claude, DeepSeek, Grok, and Gab using over two dozen political questions drawn from a 2025 Stanford-Dartmouth research framework, per the Post. Responses were capped at 30 words and scored by a reporter for left-leaning, right-leaning, or balanced content. According to the Post as relayed by Mediaite: OpenAI's ChatGPT provided exclusively left-leaning arguments in 80% of queries, both sides in 17%, and exclusively right-leaning in 3%. Anthropic's Claude gave left-leaning responses 43% of the time and balanced answers the remaining 57%. Google's Gemini was most balanced, offering both sides 93% of the time. DeepSeek came in at 70% left-leaning. Grok (SpaceX) provided left-leaning responses 40% of the time - despite its "free speech" branding. The Post disclosed a content partnership with OpenAI.

Company Responses - Reported facts: Google spokesperson Lauren Fine said Gemini is designed to provide balanced responses that do not favor any political ideology, per the Post. Anthropic spokesperson Michael Aciman said Claude is trained to treat different political viewpoints equally and is tested for bias before each model launch, per the Post. OpenAI did not immediately respond, per the Post's account.

Editorial Reaction - The New York Post editorial board published an opinion piece on June 29, 2026, citing the WaPo study and adding further examples: Daily Wire reporter Ryan Saavedra tested Anthropic's Claude by asking it to identify extreme statements from House candidate Darializa Avila Chevalier's social-media feed. According to the NYPost, Claude declined, citing concerns about decontextualized quotes, but reportedly complied when asked to do the same for President Donald Trump - a case of asymmetric engagement that is separate from the WaPo directional-bias figures.

Technical context

Pretraining corpus composition and instruction-tuning or RLHF choices are the primary levers that shape directional bias in outputs. The WaPo methodology - short forced responses, human scoring, repeated prompts - reduces equivocation and makes results more comparable across models than open-ended qualitative testing. The study is not peer-reviewed but uses a published academic framework as its question source. The asymmetric compliance behavior reported by the NYPost adds a second evaluation dimension: model willingness to engage with politically-charged requests may differ by subject even when aggregate directional-bias scores appear moderate.

What to watch

Whether the WaPo releases its full prompt dataset for independent replication; how model providers update internal bias benchmarks following the coverage; and whether RLHF or fine-tuning dataset disclosures follow public pressure.

Key Points

  • 1WaPo testing found ChatGPT left-leaning 80% of the time while Gemini gave both-sides answers 93% of the time - reproducible benchmark data for political-bias evaluation.
  • 2The methodology capped responses at 30 words and used a 2025 Stanford-Dartmouth question set, making results more comparable across models than open-ended qualitative tests.
  • 3Asymmetric compliance - models declining some prompts but not equivalent ones - adds a second evaluation dimension beyond directional bias scoring.

Scoring Rationale

The Washington Post's June 24 empirical study provides quantitative political-bias benchmarks (ChatGPT 80% left-leaning, Gemini 93% both-sides) using a published academic framework, making it substantively more useful to practitioners than anecdotal reports. The NYPost editorial's asymmetric-compliance example adds a second evaluation dimension. Score reflects that this is a real measurement study with reproducible methodology, offset by single-outlet methodology and lack of peer review.

Practice with real Ad Tech data

90 SQL & Python problems · 15 industry datasets

250 free problems · No credit card

See all Ad Tech problems