Harvard AI Outperforms Doctors in ER Triage Study

Gizmodo reports that researchers at Harvard and Beth Israel Deaconess Medical Center tested an LLM, o1-preview, against two attending physicians on emergency department triage cases in a study published in Science. Per Gizmodo, the model made the correct call in 67.1% of 76 real ED cases, while the two physicians scored 55.3% and 50.0%, respectively. Blinded physician reviewers were reportedly unable to distinguish AI-generated from human-made diagnoses, according to Gizmodo. The study also evaluated o1-preview on 143 New England Journal of Medicine clinical vignettes; lead author Thomas Buckley is quoted by Gizmodo saying the model included the correct diagnosis in its differential in 78.3% of those cases and suggested a helpful diagnosis in 97.9% when the differential was expanded. Study coauthor Arjun Manrai is quoted saying, "I don't think our findings mean that AI replaces doctors," per Gizmodo.
What happened
Gizmodo reports that researchers at Harvard and Beth Israel Deaconess Medical Center tested an LLM, o1-preview, against two attending physicians as part of a study published in Science. Per Gizmodo, o1-preview made the correct call in 67.1% of 76 real emergency department triage cases, while the two physicians scored 55.3% and 50.0%, respectively. Gizmodo reports that blinded physician reviewers could not reliably tell AI-produced diagnoses from human-made ones. The article also quotes lead author Thomas Buckley saying o1-preview included the correct diagnosis in its differential in 78.3% of 143 New England Journal of Medicine vignettes and suggested a helpful diagnosis in 97.9% when expanding the differential, per Gizmodo.
Technical details
Editorial analysis - technical context: Reasoning-capable LLMs, often implemented with stepwise or chain-of-thought prompting, tend to improve performance on multi-step clinical tasks such as building differential diagnoses. Observers note that test formats matter: performance on curated vignettes or triage summaries can substantially overstate real-world utility because those inputs are compressed, deidentified, and lack downstream workflow constraints. The Gizmodo piece does not provide a full technical audit of o1-preview model architecture or training data, so comparisons should treat reported percentages as outcomes tied to the study's specific evaluation setup.
Context and significance
Industry context
Single-center or small-sample studies demonstrating LLM superiority on narrow tasks are important signals for research progress but not definitive proof of safe clinical deployment. Prior benchmarks and human baselines vary widely by dataset, as the article notes by comparing the study to a Nature-reported physician baseline of 44.5% on a larger set of vignettes. For practitioners and hospital IT teams, the key takeaway is that improved diagnostic accuracy in a study is necessary but not sufficient for operational adoption; prospective trials, fail-safes, and integration testing remain required steps.
What to watch
What to watch
follow-up work that replicates results across diverse hospitals and patient populations, independent audits of model outputs for calibration and failure modes, and prospective trials measuring patient-level outcomes rather than vignette or triage accuracy. Also watch for peer review details in the Science publication and any public release of evaluation data or model prompts that would allow external validation. Finally, monitor regulatory and institutional guidance on clinical use of generative AI tools.
Scoring Rationale
The study reports a notable performance gap favoring a reasoning LLM on triage tasks, which is important to practitioners following clinical AI. However, the sample sizes and vignette vs real-world distinctions reduce near-term deployment implications, keeping the story in the "notable" range.
Practice with real Health & Insurance data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Health & Insurance problems


