Industry Applicationsemergency medicinehealthcare ailarge language modelstriage

Study Finds AI Beats Doctors at Emergency Triage

|
7.0
Relevance Score
Study Finds AI Beats Doctors at Emergency Triage
Photo: images.theconversation.com · rights & takedowns

According to The Conversation, a new study using written emergency department records from a hospital in Boston found a large language model identified the correct diagnosis, or a closely related one, at triage in 67% of cases; two doctors scored 50% and 55% respectively. The Conversation reports the model was tested at multiple points in patient care using real clinical notes, and that the AI worked only from text and did not examine or see patients. Editorial analysis: The result is notable because it uses real-world notes rather than exams, but the study's text-only design and the gap between experimental evaluation and clinical deployment limit immediate operational implications.

What happened

According to The Conversation, a new study evaluated a large language model on real emergency department notes from a hospital in Boston and compared its performance to two emergency physicians. The Conversation reports the model identified the correct diagnosis, or a closely related diagnosis, at triage in 67% of cases, while the two doctors achieved 50% and 55% respectively. The Conversation adds that the model was assessed at several points during care but never saw patients or performed physical examinations.

Technical details

The Conversation states the evaluation used written clinical text only, drawn from genuine emergency department records, and compared AI outputs against clinician judgments across multiple tasks. The article notes prior benchmark work showing large language models could pass medical exams but frames this study as more directly relevant because it uses real clinical documentation rather than exam-style questions.

Editorial analysis - technical context

Industry-pattern observations: Text-only evaluations commonly overestimate real-world clinical performance because bedside cues, vital-sign trends, and ad hoc team communications are absent. Comparable retrospective studies often report higher relative performance for models than later prospective or live-deployment studies do, because integration, data quality, and workflow factors change in practice.

Context and significance

Editorial analysis: For practitioners, the study adds to a growing body of evidence that large language models can generate useful differential diagnoses from notes. However, the Conversation emphasizes the study's limitations, especially the lack of visual and interactional data, and cautions against assuming parity with clinicians in live settings. The article also raises governance and safety questions about whether current oversight is ready for such tools in high-stakes triage.

What to watch

Editorial analysis: Observers should look for prospective trials, external validation at other hospitals, and evaluations that include multimodal inputs and workflow integration metrics. Regulatory guidance and institution-level governance policies will be key to assessing whether similar models can move from retrospective promise to safe clinical tools.

Key Points

  • 1Retrospective evaluation on real emergency notes shows a model achieving **67%** correct or related diagnoses at triage versus **50%** and **55%** for two physicians.
  • 2Industry-pattern observations: text-only LLM evaluations often overestimate performance compared with prospective, multimodal clinical deployments.
  • 3Practitioners should prioritise external validation, prospective trials, and governance frameworks before operational use in emergency triage.

Scoring Rationale

The story is notable because it evaluates a large language model on real emergency department notes and reports a substantial accuracy gap versus clinicians. Limitations such as text-only inputs and lack of prospective validation reduce immediate operational impact for practitioners.

Practice with real Health & Insurance data

90 SQL & Python problems · 15 industry datasets

250 free problems · No credit card

See all Health & Insurance problems