Researchers Encode Humility into Clinical AI Responses

Leo Anthony Celi, an intensive care physician at Beth Israel Deaconess Medical Center and researcher at MIT, is working to make AI models more likely to say "I don't know," the Korea Times reports. Per the article, Celi and colleagues published a framework in 2026 in BMJ Health & Care Informatics called BODHI, short for Balanced, Open-minded, Diagnostic, Humble, Inquisitive. The Korea Times reports the team wrapped GPT-4.1-mini and GPT-4o-mini in BODHI and tested both on 1,000 clinical cases; with BODHI the models paused in 735 cases, often prompting for more information. The article also reports a tradeoff: both models scored about 12 percentage points lower on standard communication-quality tests when operating under BODHI. The piece cites broader concerns about automation bias and the scale of preventable medical errors in the United States.
What happened
The Korea Times reports that Leo Anthony Celi, an intensive care doctor at Beth Israel Deaconess Medical Center who leads research at MIT's Institute for Medical Engineering and Science and teaches at Harvard Medical School, and his team proposed a framework called BODHI and published it in 2026 in BMJ Health & Care Informatics. The Korea Times reports the framework name stands for Balanced, Open-minded, Diagnostic, Humble, Inquisitive. The same report says the researchers wrapped GPT-4.1-mini and GPT-4o-mini in BODHI and tested both models on 1,000 clinical cases, finding the models paused in 735 cases under BODHI and often asked for more context. The Korea Times also reports the study observed about a 12 percentage point drop on standard communication-quality benchmarks when the models used BODHI. The article quotes Celi: "Right now, we use AI as an oracle. We could use it as a coach."
Technical details
Per the Korea Times summary of the published work, BODHI requires the model to assess its confidence and identify unknowns before answering. The article describes the implementation as a wrapper around existing LLM endpoints, applied to GPT-4.1-mini and GPT-4o-mini, and evaluated on clinical vignettes. The reported metric set includes pause-or-ask behavior counts (735/1,000) and communication-quality benchmarks where scores fell by ~12 percentage points when the model expressed uncertainty.
Editorial analysis
Industry context: Research that boosts model humility typically trades polished-sounding outputs for explicit uncertainty. Companies and hospitals deploying clinical AI face a known automation-bias problem, where clinicians can defer to confident-seeming systems. Observed patterns in similar research show that benchmarks often reward fluency and apparent certainty, which can mask clinical risk. This study, as reported, exemplifies that tension: encouraging inquiry reduces apparent competence on standard communication metrics while potentially reducing harmful overconfidence.
What to watch
For practitioners and implementers, monitor whether the peer-reviewed paper reports downstream measures of clinical safety or clinician decision changes, and whether external evaluations replicate the 735/1,000 pause rate and the 12 percentage point benchmark drop. Also watch how benchmark designers adapt metrics to reward appropriate uncertainty rather than surface fluency.
Scoring Rationale
This study addresses a concrete safety tradeoff in clinical AI that matters to practitioners building deployed systems. It is notable for empirically measuring uncertainty behavior, but it is specialized to healthcare and not a frontier-model breakthrough.
Practice with real Health & Insurance data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Health & Insurance problems


