Study finds Centaur memorizes rather than understands tasks

A July 2025 study published in Nature introduced an AI model called Centaur that reportedly matched human behavior across 160 cognitive tasks, according to ScienceDaily. A follow-up paper by researchers at Zhejiang University, published in National Science Open and summarized by ScienceDaily, challenges those claims and argues Centaur may be overfitting to its training data rather than demonstrating true task understanding. In new evaluations the authors replaced original task prompts with neutral instructions such as "Please choose option A," and observed that Centaur continued to produce the previously reported "correct" answers, which the Zhejiang University team interprets as memorization. The Zhejiang University paper and ScienceDaily coverage call for stronger controls against dataset leakage and more rigorous out-of-distribution probes when evaluating models aimed at simulating human cognition.
What happened
A July 2025 study published in Nature introduced an AI model called Centaur that was reported to replicate human responses across 160 cognitive tasks, according to ScienceDaily. A newer paper from researchers at Zhejiang University, published in National Science Open and reported by ScienceDaily, re-evaluates Centaur and finds evidence consistent with overfitting and memorization rather than task-general understanding. The Zhejiang University team constructed alternate evaluation scenarios, including replacing task-specific prompts with a neutral instruction such as "Please choose option A," and they report that Centaur continued to return the answers that matched the original dataset's expected responses.
Editorial analysis - technical context
Industry-pattern observations: Overfitting and dataset leakage are well-known failure modes for large language models and cognitive-simulation claims. Researchers commonly use interventions such as prompt perturbation, held-out experimental stimuli, and synthetic out-of-distribution tests to test whether a model has learned an underlying task or merely memorized training examples. The Zhejiang University study applies this class of probes and reports behavior that aligns with memorization, which is consistent with prior work showing superficially human-like outputs can arise from pattern matching on large training corpora.
Context and significance
Industry context
For researchers attempting to model human cognitive processes using large language models, the incident highlights the gap between reproducing aggregate task-level statistics and demonstrating mechanistic, generalizable competence. Public reporting framed Centaur as a step toward unified cognitive modeling; the Zhejiang University paper, as covered by ScienceDaily, urges caution and more robust evaluation methodology before equating multi-task performance with human-like understanding.
What to watch
Observers should look for independent replications of the Zhejiang University experiments, disclosures of the training and evaluation datasets used for Centaur, and whether future cognitive-modeling papers adopt stricter out-of-distribution tests or pre-registered evaluation protocols. Increased use of adversarial or instruction-agnostic probes will be a useful indicator that the field is adjusting evaluation standards.
Scoring Rationale
The critique matters to researchers using large language models for cognitive modeling and to practitioners designing robust evaluation pipelines. It is notable but not industry-shaking because it refines methodology rather than introducing a new capability.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

