Models & Researchturing testlarge language modelsgpt 4.5research evaluation

Researchers Report LLMs Outperform Humans in Turing Test

|June 9, 2026|By LDS Team

6.5

Relevance Score

Researchers Report LLMs Outperform Humans in Turing Test — Photo: bgr.com · rights & takedowns

A peer-reviewed study in the Proceedings of the National Academy of Sciences (PNAS) by Cameron R. Jones and Benjamin K. Bergen of UC San Diego reports that GPT-4.5, when prompted to adopt a humanlike persona, was judged to be the human 73% of the time in controlled three-party Turing tests, more often than the real human was selected. Across two preregistered, randomized experiments using five-minute text chats, LLaMA-3.1-405B was judged human 56% of the time, while baselines GPT-4o and ELIZA scored 21% and 23%. The work first appeared as an arXiv preprint in March 2025 and has now cleared peer review. The authors and outside commentators stress the test measures conversational indistinguishability, not reasoning or understanding.

What happened

The Proceedings of the National Academy of Sciences (PNAS) has published a peer-reviewed study, "Large language models pass a standard three-party Turing test," by Cameron R. Jones and Benjamin K. Bergen of UC San Diego. The paper reports that GPT-4.5, when prompted to adopt a humanlike persona, was judged to be the human 73% of the time, significantly more often than interrogators selected the actual human participant. The work first circulated as an arXiv preprint in March 2025 (arXiv:2503.23674) and has since completed peer review.

Results

Per the paper, the study evaluated four systems. GPT-4.5 was judged human 73% of the time and LLaMA-3.1-405B 56%, while the baselines GPT-4o and the 1960s chatbot ELIZA were judged human 21% and 23% of the time. The authors describe the GPT-4.5 result as the first empirical demonstration that an artificial system passes a standard three-party Turing test.

Methodology

The experiments used a classic three-party imitation game: an interrogator held simultaneous five-minute text conversations with one human and one AI system, then judged which was human. The authors report two randomized, controlled and preregistered runs on independent participant populations, and note that the best-performing configuration prompted the model to adopt a humanlike persona.

Editorial analysis - what it does and does not show

Passing a three-party Turing test under controlled conditions is a notable result because the classic setup is stricter than many two-party or single-witness configurations. As a general matter, however, the Turing test measures whether a system's short-form conversation is distinguishable from a human's, not whether it reasons, plans or understands. Independent commentary reproduced by outlets such as The Conversation and LiveScience emphasizes that indistinguishability in brief chats does not imply human-level cognition.

Caveats

The findings rest on short five-minute exchanges, specific prompting, and particular participant pools, which constrain how broadly they generalize. Some commentators have also used the result to question how the field and reviewers interpret Turing-style claims, underscoring that careful framing matters.

What to watch

Useful follow-ups would vary conversation length, participant demographics and prompting regimes, and would separate surface-level linguistic mimicry from tasks requiring sustained planning, world modeling or grounded interaction. Product and safety teams will also weigh implications for detection, disclosure and deception in conversational systems.

Key Points

1A PNAS study reports GPT-4.5, prompted to act human, was judged the human 73% of the time in three-party Turing tests.
2The peer-reviewed result is presented as the first showing an AI passing a controlled three-party Turing test against a live human.
3Turing-test success measures conversational indistinguishability, not reasoning or understanding, so treat it as evidence of surface realism.

Scoring Rationale

Peer-reviewed publication in PNAS elevates a widely discussed 2025 preprint into a citable, landmark result for LLM evaluation, with a clean 73% headline figure. Impact stays in the notable rather than paradigm range because the underlying finding is already known, applies to short controlled chats, and measures indistinguishability rather than capability.

MoreLLMs news

Sources

Public references used for this report.

8 sources

pnas.orgLarge language models pass a standard three-party Turing test

arxiv.org[2503.23674] Large Language Models Pass the Turing Test - arXiv

today.ucsd.eduAI Can Seem More Human Than Real Humans in a Classic Turing Test, Study Finds

View 5 more sources

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems