Mpathic Debuts mPACT Benchmark, Evaluates Chatbots' Crisis Safety

Business Insider and a GlobeNewswire press release report that Seattle-based mpathic launched mPACT, a clinician-led benchmarking suite that tests how major AI chatbots handle high-risk conversations including Suicide Risk, Eating Disorders, and Misinformation. Axios reports that the suicide benchmark used 300 multi-turn role plays, each 10-15 turns long, created by 50 licensed clinicians. Initial results, reported by Business Insider and Axios, show models generally avoid overtly harmful replies but perform unevenly on clinically helpful interventions; Claude Sonnet 4.5, `GPT-5.2`, and `Gemini 2.5 Flash` ranked among the stronger performers on different dimensions. Editorial analysis: For practitioners, mPACT highlights gaps between harm-avoidance and clinically adequate support in extended, subtle-risk conversations.
What happened
Business Insider and a GlobeNewswire press release report that Seattle-based mpathic launched mPACT (mpathic Psychologist-led AI Clinical Tests), a clinician-led benchmark intended to evaluate how leading chatbots respond in high-risk conversation categories including Suicide Risk, Eating Disorders, and Misinformation. Axios reports that the suicide benchmark comprises 300 multi-turn role plays, each 10-15 turns long, and that the scenarios were designed by 50 licensed clinicians. Business Insider and Axios report initial rankings showing `Claude Sonnet 4.5`, `GPT-5.2`, and `Gemini 2.5 Flash` among top performers on different metrics of safety and clinical helpfulness.
Technical details
Editorial analysis - technical context: The mPACT methodology departs from single-prompt evaluations by using clinician-authored, multi-turn role plays scored by human experts, a format that stresses longitudinal context, escalation dynamics, and subtle behavioral cues. This style of evaluation captures failure modes automated metrics often miss, such as responses that are superficially supportive yet clinically counterproductive. For ML teams, datasets and rubrics that emphasize multi-turn context increase annotation cost but produce higher-fidelity signals for downstream model iteration.
What the tests found
Business Insider and Axios report that models generally avoided explicitly dangerous answers in clear crisis prompts but showed weaker, more variable performance when risk cues were indirect or emerged over multiple turns. Axios quotes mpathic co-founder and chief business officer Danielle Schlosser saying, "Many of these systems do fairly well when the risk is very explicit." Business Insider and the GlobeNewswire release indicate that no single model dominated all dimensions; for example, `Claude Sonnet 4.5` had the highest composite safety and helpfulness in the suicide benchmark, `GPT-5.2` was noted for consistent harm-avoidance, and `Gemini 2.5 Flash` placed among top performers on some measures.
Context and significance
Editorial analysis: Clinically grounded, human-scored benchmarks like mPACT matter because they map more directly to real-world harms and the ethical obligations of products used for emotional support. For practitioners, this points to two operational imperatives commonly observed in the sector: invest in multi-turn evaluation resources and combine automated metrics with expert human judgment when assessing behavior in safety-critical domains. Public reporting also notes that the models struggled more on the eating-disorder benchmark, where cues are often indirect and embedded in culturally normalized language about diet or fitness.
Limitations and caveats
Business Insider and Axios note that mpathic is a for-profit company that offers consulting and red teaming services; Axios explicitly calls out that the company is paid to consult with leading labs. The initial mPACT results are early and based on selected scenarios and scoring rubrics created by clinicians; differences in rubric design, annotator training, and scenario selection can materially affect model rankings. Editorial analysis: Observers should treat benchmark rankings as a starting point for investigation rather than a definitive, deployment-ready certification.
What to watch
Editorial analysis: Practitioners and product teams will want to track:
- •whether major labs adopt clinician-authored, multi-turn benchmarks into their internal evaluation pipelines
- •how labs respond in technical changelogs or mitigation papers when gaps are documented
- •the emergence of standardized rubrics that enable cross-benchmark comparability. Also watch for published details about inter-annotator agreement, rubric definitions, and whether mPACT expands to additional domains or releases deidentified test data for external validation
Practical takeaway for ML teams
Editorial analysis: For teams building user-facing conversational agents, the mPACT findings reinforce the value of human-in-the-loop safety evaluation for nuanced, longitudinal risk. Investing in domain experts for benchmark design and scoring increases evaluation fidelity but also requires clear documentation to ensure reproducibility and to guide fine-tuning or safety-layer interventions.
Scoring Rationale
A clinician-grounded, multi-turn benchmark covering suicide and eating disorders is immediately relevant to safety teams, product managers, and researchers assessing conversational models; it is notable but not a frontier model release. The score reflects actionable relevance for practitioners and the early-stage nature of the results.
Practice with real Health & Insurance data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Health & Insurance problems

