AI Chatbots Deliver Problematic Health Advice Half the Time

A multi-institution study tested five popular chatbots across 250 medical queries and found that roughly 50% of responses were flagged as problematic, including about 20% judged highly problematic. The evaluated models, ChatGPT, Gemini, Grok, Meta AI, and DeepSeek, produced poor or fabricated reference lists, with a median reference completeness of 40%. Performance varied by domain: vaccines and cancer fared better while nutrition and athletic performance produced the weakest answers. Open-ended prompts amplified risk, with 32% of open responses rated highly problematic versus 7% for closed questions. Only 2 of 250 requests were refused. The results underscore systemic safety and reliability gaps for current generative chat interfaces used for health advice.
What happened
A team of researchers published a systematic evaluation in BMJ Open showing that generative chatbots frequently give unsafe or misleading medical advice. The study posed 250 clinical and health questions across five domains to five widely used models and found roughly 50% of answers were flagged as problematic, with about 20% rated highly problematic.
Technical details
The study tested `ChatGPT`, `Gemini`, `Grok`, `Meta AI`, and `DeepSeek` on 50 questions spanning cancer, vaccines, stem cells, nutrition, and athletic performance. Two domain experts independently rated every response for clinical risk, factual accuracy, and citation quality. Key quantitative findings: the median reference completeness score was 40%, only 2 responses were outright refused, and open-ended prompts produced 32% highly problematic answers versus 7% for closed prompts. Model-level performance clustered, with Grok flagged worst at 58% problematic responses, `ChatGPT` at 52%, and `Meta AI` at 50%.
What practitioners need to know
- •The evaluation captured both factual errors and unsafe recommendations, including plausibly authoritative-sounding but unsupported claims.
- •Reference lists were often incomplete or fabricated, reducing the ability to triage or verify advice.
- •Domains with well-structured evidence, such as vaccines and cancer, produced fewer errors but still carried substantial risk.
Context and significance
The findings reinforce an emerging pattern: high fluency and persuasive language do not guarantee clinical accuracy. This study complements prior work that found LLMs can outperform clinicians on style or empathy metrics while still producing medically incorrect outputs. For deployed health features, the combination of confident hallucinations and weak citation fidelity raises regulatory, ethical, and clinical safety issues. The disparity between closed and open prompts highlights a usability risk: typical patient queries are open and ambiguous, increasing the chance of unsafe guidance.
Why it matters for product and risk teams
If your product routes users to chat-based health help or uses LLMs for triage, assume substantial residual risk without explicit guardrails. Reference quality cannot be relied on for verification, and refusal behavior is rare. Design controls should include clear disclaimers, structured follow-up questions to narrow scope, retrieval-augmented verification pipelines, and human-in-the-loop thresholds for any recommendation with clinical consequence.
What to watch
Expect follow-up studies that replicate across more models and languages, and regulatory attention focused on consumer-facing medical claims. Practitioners should track improvements in citation grounding, retrieval-augmented generation adoption, and whether vendors expose uncertainty estimates or calibrated refusal policies.
Bottom line
These chat interfaces are conversational and persuasive but not yet dependable clinical advisors. Treat them as information aids requiring verification, not as substitutes for professional medical judgment.
Scoring Rationale
The study exposes systemic safety and reliability issues across multiple mainstream LLMs that many practitioners integrate into consumer-facing products. It is notable for its practical implications for deployment, risk mitigation, and regulatory attention, but it does not introduce a new technical capability or paradigm shift.
Practice with real Ad Tech data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Ad Tech problemsStep-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.



