Study Tests Patient Cognitive Bias in LLM Consultations

A simulation-based comparative study published in the Journal of Medical Internet Research (JMIR) finds that patient cognitive bias reduces LLM diagnostic accuracy by 10-40 percentage points (P < .001) across six models tested on 1,273 MedQA-USMLE cases. Researchers Yi Zuo, Qifeng Wan, and Shalong Wang developed a simulated patient agent that generated confirmation-biased and unbiased consultations, finding that errors frequently reflected user misconceptions -- the bias-influenced error proportion (BIEP) exceeded 33%. Neither prompt engineering nor temperature adjustments provided consistent resilience. A dual-system framework pairing a foundation model (System 1) with o1-Mini as a deliberative reasoning layer (System 2) recovered 10-39 percentage points of lost accuracy (P < .001). The findings establish user cognitive bias as a newly quantified behavioral risk in patient-facing AI tools, with implications for clinical deployment standards and evaluation benchmarks.
What the study found
A simulation-based comparative study published in the Journal of Medical Internet Research (JMIR) establishes that patient cognitive bias meaningfully degrades LLM diagnostic performance in health consultations. Researchers Yi Zuo, Qifeng Wan, and Shalong Wang developed a simulated patient agent to generate unbiased and confirmation-biased consultations using 1,273 MedQA-USMLE cases, then evaluated six LLMs of varying capacities through multi-turn dialogues. The primary finding: user cognitive bias reduced diagnostic accuracy by 10-40 percentage points (P < .001), with smaller models occasionally performing near chance level. A secondary metric, the bias-influenced error proportion (BIEP), exceeded 33% -- meaning a substantial fraction of model errors directly reflected the user's misconceptions rather than independent model reasoning.
Methods
The study used two bias-simulation modes: unbiased consultations and confirmation-biased consultations in which the simulated patient agent steered dialogue toward a preconceived diagnosis. Authors measured three outcomes: diagnostic accuracy, bias-induced accuracy decline (BIAD, loss under bias), and bias-influenced error proportion (BIEP, fraction of errors aligned with user misconceptions). They then tested four prompt-based mitigation strategies, four temperature settings, and a dual-system framework inspired by dual-process cognitive theory -- System 1 being a standard foundation model and System 2 being o1-Mini as a deliberative reasoning layer.
Key results
Prompt engineering and temperature adjustments produced limited or inconsistent improvements -- neither reliably counteracted patient confirmation bias. In contrast, the dual-system framework increased accuracy by 10-39 percentage points and recovered most or all of the bias-driven performance gap (P < .001). This suggests architectural interventions, rather than prompting alone, are needed for bias-resilient clinical AI.
Why it matters
For practitioners building or evaluating patient-facing AI tools, the study introduces a concrete and previously underspecified failure mode: users themselves are a source of reasoning error. Standard benchmarks such as MedQA do not capture this dimension; the study's BIAD and BIEP metrics provide a practical evaluation vocabulary. The dual-system result offers a deployment path -- pairing a fast response model with a slower deliberative reasoning model may be a scalable safeguard for higher-stakes medical applications.
Scoring Rationale
Solid niche research with quantitatively significant findings: 10-40 percentage points accuracy drop under patient bias is meaningful and underspecified in existing benchmarks. The dual-system mitigation result has practical deployment relevance. Score reflects a well-executed domain-specific study rather than a paradigm-shifting result.
Practice with real Health & Insurance data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Health & Insurance problems

