Multimodal LLMs Produce Diagnostic Errors in Radiology

Researchers at NYITCOM led by Milan Toma published a 2026 Algorithms study testing five multimodal LLMs (GPT-5, Gemini 3 Pro, Llama 4 Maverick, Grok4, Claude Opus 4.5 Extended) on a CT brain scan, finding a 20 percent rate of fundamental diagnostic errors and wide interpretive variability. The paper reports inconsistencies in stroke characterization and cross-model grading, concluding LLMs are unsuitable for autonomous diagnosis and require expert oversight.
Key Points
- 1Report finds 20% fundamental diagnostic error rate across five multimodal LLMs on one CT brain scan
- 2Shows high variability in timing, alternative diagnoses, and affected regions despite some correct primary findings
- 3Implies LLMs are unsuitable for autonomous radiologic diagnosis; require expert oversight and task-specific tools
Scoring Rationale
Peer-reviewed evidence of notable LLM diagnostic errors across major models, limited by single-case testing and narrow dataset.
Sources
Public references used for this report.
Practice with real Ad Tech data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Ad Tech problems
