LLaMA 3.1 Extracts Structured Information from Brain MRI Reports

Per the arXiv preprint 2606.07721, researchers evaluated an open-weight large language model, LLaMA 3.1, on 947 Dutch brain MRI reports from a tertiary memory clinic (2016-2021). Medical-student annotators labeled thirty variables; 100 reports were double-annotated to measure inter-rater reliability, according to the paper. The authors report strong zero-shot performance on visual rating scores, for example Medial Temporal Atrophy left 90% (95% CI 77-100%) and right 96% (95% CI 94-99%), and high detection accuracy for microbleed mentions 93% (95% CI 92-95%). Numerical counts were weaker in zero-shot but improved with few-shot prompting; the paper reports microbleed-count accuracy rising to 92% (95% CI 90-93%) with structural-similarity based example selection. Editorial analysis: This study demonstrates that open-weight LLMs can perform robust clinical extraction on non-English radiology text, while few-shot strategies materially help numeric extraction.
What happened
Per the arXiv preprint 2606.07721, the authors analyzed 947 brain MRI reports authored by consultant neuroradiologists at a tertiary memory clinic from 2016-2021. Medical students annotated thirty target variables and double-annotated 100 reports for inter-rater reliability. The paper evaluates the open-weight model LLaMA 3.1 on Dutch reports and on English translations, using zero-shot and few-shot prompting with different example-selection strategies.
Technical details
Per the preprint, evaluation metrics included balanced accuracy for categorical labels, accuracy and mean absolute error for counts, and text-similarity measures for free-text outputs. The team compared zero-shot performance to few-shot prompting where examples were selected via structural similarity among candidate reports.
Results
The preprint reports high zero-shot performance on visual rating scales: Medial Temporal Atrophy left 90% (95% CI 77-100%) and right 96% (95% CI 94-99%), Global Cortical Atrophy 87% (95% CI 83-91%), and Fazekas 94% (95% CI 93-96%). Detection of microbleed mentions reached 93% accuracy (95% CI 92-95%); infarct mentions 82% (95% CI 80-84%). Text similarity for lesion location achieved 0.95 (95% CI 0.95-0.96). Numerical extraction was weaker in zero-shot: number of microbleeds 80% (95% CI 78-82%) and infarct counts 66% (95% CI 63-68%). The authors report that few-shot prompting with structural similarity selection improved numerical extraction to 92% (95% CI 90-93%) for microbleeds and 81% (95% CI 77-85%) for infarcts. English translations produced comparable results, per the paper.
Editorial analysis - technical context
Studies applying open-weight LLMs to clinical text provide practical reproducibility advantages versus closed models. Industry-pattern observations: projects using LLMs for structured extraction often find categorical labels and named-entity detection are easier to reach high accuracy on than exact numeric counts or highly granular location details. The reported improvement from targeted few-shot example selection aligns with prior work showing retrieval- or similarity-based example choice helps with low-frequency or numeric tasks.
Context and significance
Industry context: For practitioners curating research cohorts or building registries from radiology text, the results imply that open-weight LLMs like LLaMA 3.1 can automate many visual-rating and mention-detection tasks on non-English reports, while numeric extraction may need tuned prompting or additional post-processing.
What to watch
Follow-up indicators include replication on larger multi-center datasets, head-to-head comparisons with clinical-domain tuned models, and evaluation of downstream dataset bias or error propagation into research cohorts. The authors note comparable performance on English translations, which suggests adaptability across languages but invites broader validation.
Scoring Rationale
A single early-stage arXiv preprint showing an open-weight LLaMA model can extract structured ratings from non-English (Dutch) neuroradiology reports. Solid applied clinical-NLP work but narrow in scope and unreplicated, placing it in the interesting-research tier rather than a major release.
Practice with real Ad Tech data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Ad Tech problems