AI Chatbots Recommend Risky Cancer Alternatives

A BMJ Open stress test found five popular chatbots delivered medically problematic answers about cancer and other health topics. Researchers probed Gemini, ChatGPT, Meta AI, Grok, and DeepSeek with 50 targeted prompts across five categories in February 2025 and rated responses for accuracy and potential harm. Overall, about 50% of answers to evidence-based questions were judged "somewhat" or "highly" problematic; cancer-specific prompts produced 30% somewhat problematic and 19.6% highly problematic responses. The audit shows chatbots can create false balance between science and non-science and even point users to concrete alternatives to chemotherapy. The study warns that continued deployment without public education and oversight risks amplifying misinformation, underlining the need for stronger safety scaffolding, clearer disclaimers, and clinician-in-the-loop workflows when AI is used for health advice.
What happened
A BMJ Open study evaluated five widely used generative chatbots and found a substantial share of medical answers were inaccurate, incomplete, or potentially harmful. Researchers from the Lundquist Institute and Harbor-UCLA stress-tested Gemini, ChatGPT, Meta AI, Grok, and DeepSeek in February 2025 with 50 prompts spanning cancer, vaccines, stem cells, nutrition, and athletic performance. Overall, roughly 50% of responses to evidence-based questions were rated "somewhat" or "highly" problematic; cancer treatment prompts alone produced 30% somewhat problematic and 19.6% highly problematic answers.
Technical details
The study combined closed and open prompts designed to strain model safety and expose false-balance behavior. Closed prompts required specific, consensus-aligned answers; open prompts asked for list-style responses where models could add alternatives. Outputs were scored with pre-defined objective criteria as non-, somewhat, or highly problematic, where problematic responses could plausibly direct lay users toward ineffective or harmful actions. Auditors flagged examples where chatbots supplied concrete guidance about alternative cancer treatments or where they suggested sources and places to obtain unproven therapies. The researchers highlighted model behavior that equates scientific evidence and anecdote, a failure mode driven by training on mixed-quality web signals and reward objectives that do not sufficiently weight clinical consensus.
Why it matters
Health queries are a canonical high-risk use case for generative models because users act on advice and clinicians may not be involved. The study demonstrates two systemic issues: models produce both omission errors (incomplete, clinically unsafe summaries) and commission errors (actively recommending or pointing to harmful alternatives). That combination raises the probability of delayed care or abandonment of proven treatments. The results are consistent with prior work on hallucinations and misinformation, but the concrete examples of directing users to places offering alternative therapies increase the immediacy and regulatory salience of the problem.
Implications for practitioners and deployers
Product teams and ML safety engineers must treat health verticals as adversarial and high-consequence. Practical mitigations include stricter medical intent detection, calibrated refusal policies, provenance and citation requirements, and mandatory clinician escalation paths for high-risk queries. Health systems integrating chatbots need developer controls that enforce disclaimers, evidence-graded answers, and logging for audit. Regulators and journal editors will likely demand transparency about dataset sources, safety tuning, and post-deployment monitoring.
Limitations and caveats
The stress tests were performed in February 2025; model updates and safety patches since then may change behavior. The study used curated prompts that intentionally probed weaknesses; real-world user queries vary. Nevertheless, the findings expose design choices in dataset curation and reward shaping that will reappear across generations of models until addressed explicitly.
What to watch
Monitor vendor security advisories and policy updates, replication studies that expand prompts and temporal coverage, and regulatory action requiring guardrails for consumer-facing medical advice. Expect vendors to iterate on intent classifiers, citation systems, and clinician-in-the-loop integrations as immediate corrective measures. "Continued deployment of these chatbots without public education and oversight risks amplifying misinformation," warn the researchers.
Scoring Rationale
This is a significant safety finding because it documents harmful outputs from multiple mainstream chatbots on high-stakes medical topics. The work is widely relevant to deployers, clinicians, and regulators, and it pressures vendors to harden medical advice pipelines.
Practice with real Health & Insurance data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Health & Insurance problemsStep-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.


