Clinical Reasoning Graphs evaluate LLM diagnostic consistency
A new UCSF study finds that large language models reaching 60-70% diagnostic accuracy on complex clinical cases do not necessarily reason consistently: across 750 diagnostic traces from five LLMs on 50 New England Journal of Medicine case-conference problems, graph-similarity scores were nearly identical whether cases were clinically similar or not, and nearly identical whether models got the diagnosis right (0.488) or wrong (0.484). The paper, a Spotlight Paper at an ICML 2026 health-data workshop, introduces "clinical reasoning graphs," a structured way to extract and compare a model's diagnostic reasoning steps rather than just its final answer. It also finds that prompting models to reflect increases explicit feature analysis by 33 percent without making reasoning more consistent across similar cases. The authors release their ontology, extraction pipeline, and data publicly.
The headline result practitioners should take from this paper is a warning about a specific evaluation blind spot: two models (or the same model on two cases) can land on the same diagnostic accuracy number while reasoning in completely different, non-reproducible ways, and standard accuracy metrics cannot tell the difference. That matters directly for anyone validating a clinical LLM for deployment, because accuracy alone is not evidence the model is using a stable, auditable diagnostic process.
What happened
Researcher Nisarg A. Patel (UCSF) introduces clinical reasoning graphs: structured graph representations extracted from free-text LLM diagnostic traces using a domain-grounded ontology with 5 node types and 7 edge types (arXiv:2606.29876). The pipeline was applied to 750 traces from five LLMs across 50 New England Journal of Medicine Clinicopathological Conference cases and three prompt conditions. The core test asked whether a model's reasoning graph looks more similar across clinically similar cases than across dissimilar ones (a "diagnostic schema"). Across 15 model-condition comparisons, within-cluster and between-cluster similarity were nearly equal, and no comparison survived multiple-testing correction. Graph similarity was also nearly identical for model pairs that were both correct (0.488) and both incorrect (0.484), meaning the reasoning-graph structure captures something accuracy does not. Structured reflection prompting increased explicit discriminating-feature analysis in traces by 33 percent but did not improve cross-case consistency. The work was accepted as a Spotlight Paper at the Workshop on Structured Data for Health at ICML 2026 in Seoul.
Technical context
Clinical reasoning graphs sit between free-text chain-of-thought (hard to compare systematically) and a bare final-answer label (which discards all reasoning information). By typing each reasoning step as a node or edge in a fixed ontology, the method lets researchers run statistical tests on whether a model reuses the same diagnostic schema for similar presentations, the kind of consistency a human clinician is expected to show. The finding that graph similarity does not track with correctness is the more surprising result: it suggests models are not failing to be consistent because they are wrong, they are inconsistent regardless of whether they land on the right diagnosis.
For practitioners
Teams building or evaluating clinical LLMs should treat this as evidence that accuracy benchmarks alone are insufficient for deployment decisions in settings where auditability, clinician trust, or regulatory review matters. The authors' public release of the ontology, extraction pipeline, validation protocol, and extracted graphs lowers the bar for other teams to run the same process-level audit on their own models before or after deployment, rather than relying solely on standard accuracy or benchmark leaderboards.
What to watch
As a workshop paper rather than a full peer-reviewed journal publication, the next steps are external replication and validation of the ontology across broader clinical domains, languages, and case types beyond NEJM Clinicopathological Conference cases. Also watch whether graph-level consistency correlates with real-world clinical outcomes or expert agreement, and whether benchmark providers or regulators start incorporating process-level consistency metrics alongside accuracy when assessing clinical AI tools.
Key Points
- 1A UCSF study finds LLM diagnostic accuracy of 60-70 percent does not imply consistent clinical reasoning across similar cases.
- 2Reasoning-graph similarity was nearly identical for correct (0.488) and incorrect (0.484) diagnoses across five tested models.
- 3Reflection prompting raised explicit feature analysis by 33 percent but did not make cross-case reasoning more consistent.
Scoring Rationale
A methodologically solid, publicly-reproducible evaluation paper (accepted as an ICML 2026 workshop Spotlight Paper) that identifies a real blind spot in how clinical LLMs are assessed: accuracy does not imply consistent reasoning. Notable for practitioners and healthcare-AI evaluators, though it is a single-author workshop paper pending broader external validation, not a large-scale or industry-wide result.
Sources
Public references used for this report.
Practice with real Health & Insurance data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Health & Insurance problems

