Large Language Models Match Logistic Regression Diagnostic Accuracy

Researchers benchmarked multiple LLMs in 2026 using natural-language prompts derived from PPMI structured clinical variables to classify Parkinson disease. On a 122-participant test set, logistic regression achieved 0.960 macro F1 (accuracy 0.975) while LLM few-shot prompting reached up to 0.987 F1 (accuracy 0.992); on a 31-participant temporal validation, LLMs achieved up to 0.968 F1 versus LR 0.903. The study reports prompt sensitivity, stochastic variability, and limited temporal sample size.
Key Points
- 1Demonstrate LLMs achieve up to 0.987 macro F1 on a 122-participant test set
- 2Show multiple LLMs sustain high performance up to 0.968 F1 on 31-participant temporal validation
- 3Highlight that prompting, model choice, and fine-tuning materially affect diagnostic stability and deployment
Scoring Rationale
Relevant and peer-reviewed with moderate novelty; small temporal validation and exploratory design significantly limit generalizability.
Sources
Public references used for this report.
Practice with real Health & Insurance data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Health & Insurance problems

