Study Finds AI Beats Doctors at Emergency Triage

According to The Conversation, a new study using written emergency department records from a hospital in Boston found a large language model identified the correct diagnosis, or a closely related one, at triage in 67% of cases; two doctors scored 50% and 55% respectively. The Conversation reports the model was tested at multiple points in patient care using real clinical notes, and that the AI worked only from text and did not examine or see patients. Editorial analysis: The result is notable because it uses real-world notes rather than exams, but the study's text-only design and the gap between experimental evaluation and clinical deployment limit immediate operational implications.
What happened
According to The Conversation, a new study evaluated a large language model on real emergency department notes from a hospital in Boston and compared its performance to two emergency physicians. The Conversation reports the model identified the correct diagnosis, or a closely related diagnosis, at triage in 67% of cases, while the two doctors achieved 50% and 55% respectively. The Conversation adds that the model was assessed at several points during care but never saw patients or performed physical examinations.
Technical details
The Conversation states the evaluation used written clinical text only, drawn from genuine emergency department records, and compared AI outputs against clinician judgments across multiple tasks. The article notes prior benchmark work showing large language models could pass medical exams but frames this study as more directly relevant because it uses real clinical documentation rather than exam-style questions.
Editorial analysis - technical context
Industry-pattern observations: Text-only evaluations commonly overestimate real-world clinical performance because bedside cues, vital-sign trends, and ad hoc team communications are absent. Comparable retrospective studies often report higher relative performance for models than later prospective or live-deployment studies do, because integration, data quality, and workflow factors change in practice.
Context and significance
Editorial analysis: For practitioners, the study adds to a growing body of evidence that large language models can generate useful differential diagnoses from notes. However, the Conversation emphasizes the study's limitations, especially the lack of visual and interactional data, and cautions against assuming parity with clinicians in live settings. The article also raises governance and safety questions about whether current oversight is ready for such tools in high-stakes triage.
What to watch
Editorial analysis: Observers should look for prospective trials, external validation at other hospitals, and evaluations that include multimodal inputs and workflow integration metrics. Regulatory guidance and institution-level governance policies will be key to assessing whether similar models can move from retrospective promise to safe clinical tools.
Scoring Rationale
The story is notable because it evaluates a large language model on real emergency department notes and reports a substantial accuracy gap versus clinicians. Limitations such as text-only inputs and lack of prospective validation reduce immediate operational impact for practitioners.
Practice with real Health & Insurance data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Health & Insurance problems


