Multimodal LLMs Produce Diagnostic Errors in Radiology

Researchers at NYITCOM led by Milan Toma published a 2026 Algorithms study testing five multimodal LLMs (GPT-5, Gemini 3 Pro, Llama 4 Maverick, Grok4, Claude Opus 4.5 Extended) on a CT brain scan, finding a 20 percent rate of fundamental diagnostic errors and wide interpretive variability. The paper reports inconsistencies in stroke characterization and cross-model grading, concluding LLMs are unsuitable for autonomous diagnosis and require expert oversight.
Scoring Rationale
Peer-reviewed evidence of notable LLM diagnostic errors across major models, limited by single-case testing and narrow dataset.
Practice with real Ad Tech data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Ad Tech problems