Study Evaluates VE-LLMs for ECG Image Interpretation

An exploratory study in the Journal of Medical Internet Research tested whether vision-enabled large language models can read electrocardiograms from images, and found them not yet reliable enough for clinical use. Researchers ran six general-purpose VE-LLMs, including versions of ChatGPT, Gemini, Microsoft Copilot, and Anthropic's Claude, against 70 deidentified 12-lead ECG images from routine care at University Medical Center Gottingen in Germany, using expert consensus as the reference. The models showed moderate overall accuracy but mostly low sensitivity and limited agreement with expert readings, with performance varying widely across models and diagnostic categories. Response times also differed sharply: one ChatGPT version took a median of about 4.6 minutes per ECG while Gemini answered in roughly 36 seconds. The authors conclude that current general-purpose models are insufficient to support clinical deployment for ECG interpretation.
What was tested
An exploratory evaluation published in the Journal of Medical Internet Research examined whether generalist vision-enabled large language models (VE-LLMs) can interpret electrocardiograms presented as images. The team evaluated six widely used models, including versions of OpenAI's ChatGPT, Google's Gemini, Microsoft Copilot, and Anthropic's Claude, on a retrospective set of 70 deidentified 12-lead ECG images collected during routine care in a cardiology ward at University Medical Center Gottingen, Germany. Model inference was run in July and August 2025, and expert consensus served as the reference standard. The models were prompted to perform a structured, lead-by-lead visual analysis rather than rely on pattern-matching shortcuts.
What they found
Across diagnostic categories, the models showed moderate overall accuracy but mostly low sensitivity, meaning they frequently missed true findings even when their specificity was high. For several conditions, individual models recorded sensitivity at or near zero while still posting high accuracy, a pattern that can occur when a finding is uncommon and a model mostly predicts "normal." In detecting first-degree atrioventricular block, for instance, several models registered zero sensitivity despite high specificity. Agreement with expert ECG interpretation was limited overall, and results were inconsistent from one model and condition to the next.
Speed and consistency
Performance was uneven on practical measures too. Response times varied widely between models; according to the study, one ChatGPT version took a median of roughly 4.6 minutes to begin answering, while Gemini responded in a median of about 36 seconds. That spread, combined with variable accuracy, underscores how differently these general-purpose systems behave on the same task.
The takeaway
The authors conclude that current generalist VE-LLMs deliver performance that is inconsistent across models and diagnostic categories and is insufficient to support clinical deployment for ECG interpretation. The finding fits a broader pattern in medical AI: general-purpose multimodal models can describe images fluently but tend to underperform tools purpose-built and validated for a specific clinical task. A separate comparative study reported that a dedicated ECG AI substantially outperformed general LLMs at detecting myocardial infarction from ECG images. For now, the results argue for keeping specialized tools and expert over-read central to ECG diagnosis, and treating LLM outputs as exploratory rather than diagnostic.
Key Points
- 1WHAT: A JMIR study tested six general-purpose vision-enabled LLMs (including ChatGPT, Gemini, Copilot, and Claude) on 70 real-world 12-lead ECG images.
- 2WHY: Against expert consensus, the models showed moderate accuracy but mostly low sensitivity and inconsistent results across diagnostic categories.
- 3SO WHAT: The authors conclude generalist VE-LLMs are not yet reliable enough for clinical ECG interpretation, keeping dedicated tools and expert review essential.
Scoring Rationale
Exploratory evaluation of VE-LLMs on ECG images is relevant to clinical AI validation and developers, but it is not a major model release or policy event.
Sources
Public references used for this report.
Practice with real Health & Insurance data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Health & Insurance problems
