AI Outperforms Doctors in Emergency Diagnosis Study

A Harvard Medical School team published a study in the journal Science showing an OpenAI reasoning model, o1-preview, outperformed two attending physicians on emergency-department diagnostic tasks, according to reporting in NPR and Fortune. The researchers evaluated the model on real-world electronic health records across three timepoints, from triage to admission, and also tested it on published case reports and clinical vignettes, per Fortune and NPR. Study co-authors told Fortune and NPR that the model often matched or exceeded physician baselines but also suggested unnecessary testing in some cases. Editorial analysis: This result narrows the gap between model performance on curated exams and messy clinical practice, but clinical validation, safety checks, and workflow integration remain open questions.
What happened
A study led by researchers at Harvard Medical School and Beth Israel Deaconess Medical Center was published in the journal Science, and reporters at NPR and Fortune covered the results. Per Fortune, the team compared emergency-room diagnoses produced by an OpenAI reasoning model called o1-preview against diagnoses from two internal-medicine attending physicians, and two blinded attending physicians then adjudicated the competing outputs. Fortune and NPR report that the model matched and often outperformed the physician baseline on real-world cases drawn from electronic health records and on published case reports and clinical vignettes. Adam Rodman, a Beth Israel author, told Fortune, "I thought it was going to be a fun experiment but it wouldn't work that well. That was not at all what happened." Peter Brodeur, a co-author, told Fortune that prior multiple-choice benchmarks are effectively at ceiling for current models.
Editorial analysis - technical context
Industry-pattern observations: Evaluations that feed models the exact, uncleaned electronic health record data present a stricter, more realistic test than multiple-choice or sanitized vignettes. Models that demonstrate robust reasoning on such data typically rely on large-context understanding, implicit temporal reasoning, and the ability to weigh noisy signals. Observers note that better performance on retrospective chart tasks does not automatically translate into safe prospective use because issues like covariate shift, incomplete records, label noise, and documentation artifacts can inflate retrospective performance without reflecting deployment risk.
Context and significance
Industry context
The study adds to a stream of recent work showing large language models and other AI systems affecting clinical tasks, from protein folding to documentation and triage. Fortune cites other examples such as systems used to generate medical records and a separate December study Fortune referenced that found clinician decisionmaking can be influenced by model outputs, with one study reporting 67% influence in a specific context. The study authors told reporters the results do not imply AI is ready to replace physicians; Fortune reports the authors cautioned the model sometimes recommends unnecessary testing that could harm patients.
What to watch
For practitioners: key indicators include prospective randomized trials that measure patient-level outcomes, external validation across hospital systems and EHR vendors, evaluation of false-positive driven testing cascades, integration studies that measure how models alter clinician workflow, and regulatory review or guidance from agencies that govern clinical decision-support tools.
Scoring Rationale
A peer-reviewed Science study showing an LLM outperforming physicians on real-world ER diagnoses is a major signal for clinical AI research and deployment considerations. The score reflects high technical and practical relevance tempered by the need for prospective validation and safety evaluation.
Practice with real Health & Insurance data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Health & Insurance problems
