Researchers evaluate LLMs on multilingual vaccine questions

A study published in npj Vaccines (Chen et al., 2026) introduced VaxEval, a multilingual benchmark of 1,886 multiple-choice questions covering 14 vaccines in English, Spanish, and Chinese, drawn from WHO, CDC, UNICEF, Africa CDC, the American Medical Association, and Immunize.org. Thirteen LLMs - including GPT-4.5, GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, Llama-4 Maverick, and DeepSeek-V3 - were evaluated using zero-shot, few-shot, and chain-of-thought prompting. Average accuracy was 86.0% (English), 83.7% (Spanish), and 80.0% (Chinese), with GPT-4o leading at 90.3%. A counterintuitive finding: chain-of-thought prompting reduced correct-answer odds by 21% compared with zero-shot, while few-shot improved them by 17%. Models made persistent errors on dosing schedules, contraindications, and age eligibility, which the authors say underscores the need for medical oversight before clinical deployment.
Study and benchmark
Researchers published VaxEval in npj Vaccines (Chen, Wass, Wu, Garay, Vizoso, Leung, Wu, & Lin, 2026; DOI 10.1038/s41541-026-01500-1), a multilingual vaccine knowledge benchmark containing 1,886 multiple-choice questions across 14 vaccines. The dataset spans three UN languages: English (1,340 questions), Spanish (250), and Chinese (296). Question material was sourced from the WHO, CDC, UNICEF, Africa CDC, the American Medical Association, and Immunize.org, with additional peer-reviewed literature, and all answer keys were verified against trusted scientific sources.
Models and methods
The study evaluated 13 LLMs: GPT-4.5, GPT-4o, GPT-4, GPT-3.5-Turbo, Claude 3 Opus, Gemini 1.5 Pro, Llama-4 Maverick, DeepSeek-V3, Grok-3, Qwen 2.5, GLM-4, Reka Core, and Yi-Lightning. Each was tested under zero-shot, few-shot, and chain-of-thought prompting. Mixed-effects logistic regression was used to identify predictors of correctness across languages, vaccine types, and model generations.
Key findings
Average accuracy was 86.0% for English, 83.7% for Spanish, and 80.0% for Chinese. GPT-4o led at 90.3%, followed by Llama-4 Maverick (90.2%) and DeepSeek-V3 (89.6%). Newer flagship models as a group showed 57% higher odds of correct answers than older systems. Per the paper, few-shot prompting raised correct-answer odds by 17% over zero-shot. Chain-of-thought prompting showed the opposite effect: it was associated with 21% lower odds of correctness than zero-shot, a counterintuitive result the authors attribute to structured reasoning not reliably improving factual accuracy in this domain. Accuracy varied by vaccine: influenza (90.5%) and HPV (88.4%) scored highest; dengue (76.4%) and pneumococcal disease (77.7%) were lowest. By topic, misconceptions and corrections questions had the highest accuracy (93.0%); dosing and recommendation questions were lower (82.5%).
Error patterns
Error analysis of 150 sampled incorrect responses found nearly half involved overgeneralization - broad statements that did not account for vaccine-specific rules. Other frequent errors included wrong dosing intervals, misidentified contraindications, and incorrect age-based eligibility determinations. The authors note that rule-based failures of this type carry higher clinical risk than isolated factual recall errors.
Significance for practitioners
The authors state that multiple-choice accuracy does not establish clinical reliability and that prospective validation and context-specific safety evaluation are required before deployment in vaccine counseling settings. The Spanish and Chinese accuracy gaps were partly attributed to dataset composition differences - the non-English sets were independently constructed rather than translated - which the authors flag as a methodological caveat. The chain-of-thought finding is actionable: prompting strategies that improve performance in other domains do not necessarily transfer to clinical rule-based Q&A tasks.
Scoring Rationale
VaxEval is a well-constructed, peer-reviewed benchmark (npj Vaccines, Nature group) with concrete findings including language-stratified accuracy, a counterintuitive chain-of-thought result, and actionable error-type breakdowns. Relevant to medical AI and LLM evaluation practitioners, but scope is a single applied domain; 5.8 reflects solid research that advances a niche area without reshaping the broader LLM landscape.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

