AI Chatbots Fail Early Diagnostic Reasoning at Scale

A Mass General Brigham study evaluated 21 large language models across 29 standardized clinical vignettes and found a stark gap: models produced an appropriate initial differential diagnosis in fewer than 20% of cases, despite reaching correct final diagnoses more than 90% of the time when given complete data. The researchers used a new staged evaluation, PrIME-LLM, to score performance across initial assessment, test ordering, final diagnosis, and treatment planning. Models improved as more data was provided, but early-stage reasoning and prioritization of likely causes remain weak. The finding cautions against unsupervised clinical deployment, highlights the limits of current LLM reasoning, and reframes evaluation toward workflow-stage metrics rather than only end-point accuracy.
What happened
The Mass General Brigham team published a JAMA Network Open study showing publicly available large language models struggle with clinical reasoning. Evaluating 21 LLMs on 29 standardized clinical vignettes and 16,254 responses, researchers found models failed to produce an appropriate initial differential diagnosis in more than 80% of cases, even though final diagnostic accuracy exceeded 90% when full data was available. The study introduces the PrIME-LLM scoring framework to assess staged clinical workflows, and highlights performance spread, with scores ranging from about 64% for Gemini 1.5 Flash to roughly 78% for newer systems such as Grok 4 and GPT-5 on composite measures.
Technical details
The experiment walked models through simulated patient encounters by incrementally revealing information: demographics and symptoms, then exam findings, then labs and imaging. Medical students scored outputs against established keys from the MSD Manual. The evaluation parsed the task into stages: initial differential diagnosis, appropriate test ordering, final diagnosis, and treatment planning, using PrIME-LLM to quantify stage-wise competence. Key technical constraints included disabling web-search and external plugins, so the models operated off-base LLM weights and internal knowledge. Performance characteristics:
- •Initial differential generation, the weakest stage, failed >80% of the time across models.
- •With full clinical data, final diagnosis failure rates dropped below 40%, and top models exceeded 90% accuracy on endpoint diagnosis.
- •Models tended to list possibilities but poorly prioritized likelihoods and next-step testing, indicating a gap in uncertainty handling and causal reasoning.
Context and significance
Differential diagnosis is the procedural core of clinical reasoning, and the study reframes LLM evaluation away from single-endpoint metrics to workflow-stage fidelity. The finding matters because real-world clinical encounters are information-sparse and noisy; models that only excel when fed complete data will perform unpredictably at triage or early symptom checking. This undercuts claims that off-the-shelf chatbots are ready for unsupervised frontline use. The study also exposes a recurring technical limitation: LLMs are good at pattern completion and recall, but they lack robust internal mechanisms for hypothesis generation, probabilistic prioritization, and cost-aware test selection. That separation aligns with broader observations about chain-of-thought training and the difference between surface fluency and structured causal reasoning.
What to watch
Model improvement paths that could close this gap include targeted fine-tuning on staged clinical tasks, supervised calibration for probabilistic ranking, integration of explicit probabilistic modules, and multimodal inputs for early tests and imaging. Regulators and health systems should require stage-wise validation and prospective clinical trials before authorizing unsupervised deployments. Short term, expect vendors to emphasize data-complete use cases, tool-assisted workflows inside EHRs, and guardrails that route uncertain cases to clinicians.
"These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn't much information," said Arya Rao, lead author. "Despite continued improvements, off-the-shelf large language models are not ready for unsupervised clinical-grade deployment," added Marc Succi, co-author.
Bottom line
The study provides a practical diagnostic for the field: endpoint accuracy is necessary but not sufficient. For safe clinical use, LLMs must demonstrate reliable staged reasoning, calibrated uncertainty, and appropriate test-selection behavior, not only the ability to echo textbook answers when full data is present.
Scoring Rationale
The paper is an important, practice-oriented evaluation that shifts how researchers and vendors must validate LLMs for healthcare. It is not a frontier model release, but it meaningfully changes evaluation expectations and deployment guardrails for clinical AI.
Practice with real Health & Insurance data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Health & Insurance problemsStep-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.



