Researchers Report LLMs Outperform Humans in Turing Test

A peer-reviewed study in the Proceedings of the National Academy of Sciences (PNAS) by Cameron R. Jones and Benjamin K. Bergen of UC San Diego reports that GPT-4.5, when prompted to adopt a humanlike persona, was judged to be the human 73% of the time in controlled three-party Turing tests, more often than the real human was selected. Across two preregistered, randomized experiments using five-minute text chats, LLaMA-3.1-405B was judged human 56% of the time, while baselines GPT-4o and ELIZA scored 21% and 23%. The work first appeared as an arXiv preprint in March 2025 and has now cleared peer review. The authors and outside commentators stress the test measures conversational indistinguishability, not reasoning or understanding.
What happened
The Proceedings of the National Academy of Sciences (PNAS) has published a peer-reviewed study, "Large language models pass a standard three-party Turing test," by Cameron R. Jones and Benjamin K. Bergen of UC San Diego. The paper reports that GPT-4.5, when prompted to adopt a humanlike persona, was judged to be the human 73% of the time, significantly more often than interrogators selected the actual human participant. The work first circulated as an arXiv preprint in March 2025 (arXiv:2503.23674) and has since completed peer review.
Results
Per the paper, the study evaluated four systems. GPT-4.5 was judged human 73% of the time and LLaMA-3.1-405B 56%, while the baselines GPT-4o and the 1960s chatbot ELIZA were judged human 21% and 23% of the time. The authors describe the GPT-4.5 result as the first empirical demonstration that an artificial system passes a standard three-party Turing test.
Methodology
The experiments used a classic three-party imitation game: an interrogator held simultaneous five-minute text conversations with one human and one AI system, then judged which was human. The authors report two randomized, controlled and preregistered runs on independent participant populations, and note that the best-performing configuration prompted the model to adopt a humanlike persona.
Editorial analysis - what it does and does not show
Passing a three-party Turing test under controlled conditions is a notable result because the classic setup is stricter than many two-party or single-witness configurations. As a general matter, however, the Turing test measures whether a system's short-form conversation is distinguishable from a human's, not whether it reasons, plans or understands. Independent commentary reproduced by outlets such as The Conversation and LiveScience emphasizes that indistinguishability in brief chats does not imply human-level cognition.
Caveats
The findings rest on short five-minute exchanges, specific prompting, and particular participant pools, which constrain how broadly they generalize. Some commentators have also used the result to question how the field and reviewers interpret Turing-style claims, underscoring that careful framing matters.
What to watch
Useful follow-ups would vary conversation length, participant demographics and prompting regimes, and would separate surface-level linguistic mimicry from tasks requiring sustained planning, world modeling or grounded interaction. Product and safety teams will also weigh implications for detection, disclosure and deception in conversational systems.
Scoring Rationale
Peer-reviewed publication in PNAS elevates a widely discussed 2025 preprint into a citable, landmark result for LLM evaluation, with a clean 73% headline figure. Impact stays in the notable rather than paradigm range because the underlying finding is already known, applies to short controlled chats, and measures indistinguishability rather than capability.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

