Industry Applicationsai humilitymedical aiuncertainty calibrationgpt 4o

Researchers Encode Humility into Clinical AI Responses

|June 10, 2026|By LDS Team

7.0

Relevance Score

Researchers Encode Humility into Clinical AI Responses — Photo: newsimg.koreatimes.co.kr · rights & takedowns

Leo Anthony Celi, an intensive care physician at Beth Israel Deaconess Medical Center and researcher at MIT, is working to make AI models more likely to say "I don't know," the Korea Times reports. Per the article, Celi and colleagues published a framework in 2026 in BMJ Health & Care Informatics called BODHI, short for Balanced, Open-minded, Diagnostic, Humble, Inquisitive. The Korea Times reports the team wrapped GPT-4.1-mini and GPT-4o-mini in BODHI and tested both on 1,000 clinical cases; with BODHI the models paused in 735 cases, often prompting for more information. The article also reports a tradeoff: both models scored about 12 percentage points lower on standard communication-quality tests when operating under BODHI. The piece cites broader concerns about automation bias and the scale of preventable medical errors in the United States.

What happened

The Korea Times reports that Leo Anthony Celi, an intensive care doctor at Beth Israel Deaconess Medical Center who leads research at MIT's Institute for Medical Engineering and Science and teaches at Harvard Medical School, and his team proposed a framework called BODHI and published it in 2026 in BMJ Health & Care Informatics. The Korea Times reports the framework name stands for Balanced, Open-minded, Diagnostic, Humble, Inquisitive. The same report says the researchers wrapped GPT-4.1-mini and GPT-4o-mini in BODHI and tested both models on 1,000 clinical cases, finding the models paused in 735 cases under BODHI and often asked for more context. The Korea Times also reports the study observed about a 12 percentage point drop on standard communication-quality benchmarks when the models used BODHI. The article quotes Celi: "Right now, we use AI as an oracle. We could use it as a coach."

Technical details

Per the Korea Times summary of the published work, BODHI requires the model to assess its confidence and identify unknowns before answering. The article describes the implementation as a wrapper around existing LLM endpoints, applied to GPT-4.1-mini and GPT-4o-mini, and evaluated on clinical vignettes. The reported metric set includes pause-or-ask behavior counts (735/1,000) and communication-quality benchmarks where scores fell by ~12 percentage points when the model expressed uncertainty.

Editorial analysis

Industry context: Research that boosts model humility typically trades polished-sounding outputs for explicit uncertainty. Companies and hospitals deploying clinical AI face a known automation-bias problem, where clinicians can defer to confident-seeming systems. Observed patterns in similar research show that benchmarks often reward fluency and apparent certainty, which can mask clinical risk. This study, as reported, exemplifies that tension: encouraging inquiry reduces apparent competence on standard communication metrics while potentially reducing harmful overconfidence.

What to watch

For practitioners and implementers, monitor whether the peer-reviewed paper reports downstream measures of clinical safety or clinician decision changes, and whether external evaluations replicate the 735/1,000 pause rate and the 12 percentage point benchmark drop. Also watch how benchmark designers adapt metrics to reward appropriate uncertainty rather than surface fluency.

Key Points

1Researchers published the BODHI framework, wrapping GPT-4.1-mini and GPT-4o-mini and triggering pauses in 735 of 1,000 clinical cases, showing measurable uncertainty.
2Introducing explicit uncertainty reduced communication-quality scores by about 12 percentage points, highlighting a benchmark mismatch between fluency and safe clinical behavior.
3Industry context: Studies encouraging model humility often trade perceived competence for safer behavior, so evaluation metrics must evolve to value appropriate caution.

Scoring Rationale

This study addresses a concrete safety tradeoff in clinical AI that matters to practitioners building deployed systems. It is notable for empirically measuring uncertainty behavior, but it is specialized to healthcare and not a frontier-model breakthrough.

MoreHealthcare AI news

Sources

Public references used for this report.

1 source

koreatimes.co.krWhen humility becomes code: How AI is learning to say ‘I don't know'

Practice with real Health & Insurance data

90 SQL & Python problems · 15 industry datasets

Used by DS/ML engineers at top companies

Active PPO Plans with Rx CoverageEasy

Approved High-Value ClaimsMedium

Denial Rate by Plan TypeHard

250 free problems · No credit card

See all Health & Insurance problems

What happened

Technical details

Editorial analysis

What to watch

Key Points

1Researchers published the BODHI framework, wrapping GPT-4.1-mini and GPT-4o-mini and triggering pauses in 735 of 1,000 clinical cases, showing measurable uncertainty.

2Introducing explicit uncertainty reduced communication-quality scores by about 12 percentage points, highlighting a benchmark mismatch between fluency and safe clinical behavior.

3Industry context: Studies encouraging model humility often trade perceived competence for safer behavior, so evaluation metrics must evolve to value appropriate caution.

Researchers Encode Humility into Clinical AI Responses

What happened

Technical details

Editorial analysis

What to watch

Key Points

Scoring Rationale

Sources

More AI & Data Science News

EU AI Act Transparency Rules Take Effect August 2

EPRI Study Finds Data Centers Lowered U.S. Power Rates Through 2024

Google Research Explains Diffusion Model Novelty Mathematically

Intel Q2 Revenue Jumps 25% as Data Center Sales Rise 59%

Researchers Encode Humility into Clinical AI Responses

What happened

Technical details

Editorial analysis

What to watch

Key Points

Scoring Rationale

Sources

More AI & Data Science News

EU AI Act Transparency Rules Take Effect August 2

EPRI Study Finds Data Centers Lowered U.S. Power Rates Through 2024

Google Research Explains Diffusion Model Novelty Mathematically

Intel Q2 Revenue Jumps 25% as Data Center Sales Rise 59%