LLMs Match Radiologists Using Scoring Model

Researchers at Sun Yat-sen University retrospectively evaluated ChatGPT-4o and Claude 3.5 Sonnet on ultrasound-detected gallbladder polyps ≥1.0 cm using data from January 2011–January 2022, with 223 patients (48 adenomas) and a 100-patient external test set. Text-based scoring strategy produced higher accuracy (radiologists/LLMs 0.34–0.35 vs guideline 0.22) and reduced unnecessary resections (82–83% vs 100%), while image-based LLM analysis showed lower sensitivity.
Key Points
- 1Show scoring-model LLMs reach similar accuracy to radiologists in classifying polyps ≥1.0 cm
- 2Reduce unnecessary surgeries compared with guideline, lowering nonneoplastic resection rate from 100% to ~82-83%
- 3Enable clinics to adopt text-based scoring workflows; image-based LLM interpretation still requires improvement
Scoring Rationale
Solid peer-reviewed evaluation with actionable scoring workflow, limited novelty and single medical domain focus, reducing generalizability.
Sources
Public references used for this report.
Practice with real Logistics & Shipping data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Logistics & Shipping problems
