Researchllmevaluationframing biasprompting

LLMs Exhibit Framing Bias In Evaluations

|January 21, 2026|By LDS Team

8.0

Relevance Score

LLMs Exhibit Framing Bias In Evaluations

A January 20, 2026 arXiv preprint by Yerin Hwang et al. investigates framing bias in LLM-based evaluation, testing symmetric predicate-positive and predicate-negative prompts across four high-stakes tasks. The study measures responses from 14 LLM judges and finds significant, systematic discrepancies with model families showing distinct agreement or rejection tendencies. The authors conclude framing is a structural bias, recommending framing-aware evaluation protocols.

Key Points

1Demonstrates framing manipulates LLM judgments across four high-stakes evaluation tasks using symmetric prompts.
2Finds 14 LLM judges show systematic susceptibility, with model families tending toward agreement or rejection.
3Impacts evaluation reliability, suggesting need for framing-aware protocols and standardized prompt designs.

Scoring Rationale

High novelty and broad scope drive score, limited by preprint status and modest direct mitigations.

MoreAI Evals news

Sources

Public references used for this report.

1 source

01arxiv.org[2601.13537] When Wording Steers the Evaluation: Framing Bias in LLM judges

Practice with real Logistics & Shipping data

90 SQL & Python problems · 15 industry datasets

Used by DS/ML engineers at top companies

High-Value Overnight OrdersEasy

Delivered International ShipmentsMedium

On-Time Delivery Rate by CarrierHard

250 free problems · No credit card

See all Logistics & Shipping problems

Researchllmevaluationframing biasprompting

LLMs Exhibit Framing Bias In Evaluations

|January 21, 2026|By LDS Team

8.0

Relevance Score

Key Points

1Demonstrates framing manipulates LLM judgments across four high-stakes evaluation tasks using symmetric prompts.
2Finds 14 LLM judges show systematic susceptibility, with model families tending toward agreement or rejection.
3Impacts evaluation reliability, suggesting need for framing-aware protocols and standardized prompt designs.

Scoring Rationale

High novelty and broad scope drive score, limited by preprint status and modest direct mitigations.

MoreAI Evals news

Sources

Public references used for this report.

1 source

01arxiv.org[2601.13537] When Wording Steers the Evaluation: Framing Bias in LLM judges

Practice with real Logistics & Shipping data

90 SQL & Python problems · 15 industry datasets

Used by DS/ML engineers at top companies

High-Value Overnight OrdersEasy

Delivered International ShipmentsMedium

On-Time Delivery Rate by CarrierHard

250 free problems · No credit card

See all Logistics & Shipping problems

LLMs Exhibit Framing Bias In Evaluations

Key Points

Scoring Rationale

Sources

More AI & Data Science News

PCPD and DPO launch AI data sandbox

Lee urges rapid execution of chip cluster and AI investments

LessWrong Links Counterfactual Mugging to Psy-kosh

SiliconFlow Files for Hong Kong IPO Amid Mounting Losses

LLMs Exhibit Framing Bias In Evaluations

Key Points

Scoring Rationale

Sources

More AI & Data Science News

PCPD and DPO launch AI data sandbox

Lee urges rapid execution of chip cluster and AI investments

LessWrong Links Counterfactual Mugging to Psy-kosh

SiliconFlow Files for Hong Kong IPO Amid Mounting Losses