Why this matters for practitioners
The important finding is not that AI "fails" a psychology test - it is the architectural reason it fails. A June 2026 PNAS Nexus study provides controlled evidence that transformer-based LLMs lack an explicit top-down inhibitory mechanism for conflict resolution. That gap becomes critical in any deployment scenario requiring a model to maintain and enforce a non-default rule across long context or under competing signals: strict output-format constraints, multi-step agent instructions, adversarial robustness, or any task where the model must suppress a strongly trained prior (such as word reading, common-sense defaults, or learned phrasings) to follow an instruction that conflicts with it.
The study
"Deficient executive control in transformer attention" (Suketu Chandrakant Patel, Hongbin Wang, and Jin Fan; published in PNAS Nexus, DOI: 10.1093/pnasnexus/pgag149) tested GPT-4o and Claude 3.5 Sonnet on the color Stroop task - participants (or models) see color words printed in mismatched ink and must name the ink color, not read the word. Humans handle long lists by applying top-down executive control to suppress automatic reading. Models do not have that mechanism.
Results
GPT-4o reached 91 percent accuracy on five-word incongruent lists, showing a typical Stroop-style response. As lists grew to 20 and 40 words, accuracy collapsed: per researcher Suketu Patel quoted directly by PsyPost, GPT-4o "plummeted to just 1 percent on both the twenty-word and forty-word lists." Neuroscience News, citing the same paper, reported 57 percent at 10 words and 15 percent at 40 words for the pure-incongruent condition; the two figures likely reflect different test conditions. Claude 3.5 Sonnet maintained stability slightly longer but dropped to 10-24 percent at 40 words depending on condition. In mixed trials (matching and mismatched colors combined), performance collapsed to near-zero for all models. Identical failure patterns were confirmed in GPT-5, Claude Opus 4.1, and Gemini 2.5.
What it reveals architecturally
Per Patel (PsyPost interview): "The study shows that, at the signal level, the ability to detect and resolve the conflict degrades because transformer attention can only impose a soft constraint on that automatic reading, rather than the hard one that an executive control mechanism would provide." Biological attention has three dissociable systems - alerting, orienting, and executive control; transformers implement a form of the first two but not the third. As Patel explains: "Scaling to larger models toward ASI, the implicit wager is that this gating mechanism - what neuroscience calls executive control - will emerge from more scale and data without any dedicated architecture."
The scaffolding workaround and its limits
Some frontier models can write and run code to pass the Stroop task - GPT-5 in Thinking mode does this. The researchers call it "just avoiding it, papering over a deficiency at the signal level." Code generation and chain-of-thought bypass the test but do not fix the structural conflict-resolution gap. The authors are exploring how executive control could be built directly into transformer architecture for long-horizon instruction following.
The carwash prompt and viral demonstrations
Neurologist Steven Novella (The Skeptics' Guide to the Universe podcast) described a reproducible informal test: asking ChatGPT whether to walk or drive to a carwash 100 meters away returned opposite answers depending on subtle wording differences. Covered by Cybernews, IBM Think, and Newsweek, the carwash example illustrates the same surface-level prompt-sensitivity, though it is informal and anecdotal rather than controlled. WND/RealClearWire columnist Ross Pomeroy cited both the Stroop study and the carwash example to dispute Marc Andreessen's claim that AGI has already arrived - a framing that is editorial rather than scientific.
Key Points
- 1A PNAS Nexus study (Patel et al., CUNY) shows GPT-4o's Stroop-task accuracy collapsed from 91 percent at five words to near-zero by 40 words, replicating across GPT-5, Claude Opus 4.1, and Gemini 2.5.
- 2The failure is architectural: transformer attention applies only soft constraints on prepotent word-reading priors, lacking the hard top-down inhibitory control human executive attention sustains across long sequences.
- 3Code generation and chain-of-thought can bypass the Stroop task but do not fix the structural gap in conflict resolution - a problem that carries into any long-horizon goal-maintenance deployment.
Scoring Rationale
A controlled PNAS Nexus study identifies a structural executive-control gap in transformer architectures, with near-zero Stroop-task accuracy at 40-word lists, replicating across frontier models including GPT-5 and Claude Opus 4.1. Relevant to practitioners on goal-maintenance, constraint adherence, and agent robustness. Pulling from 5.8 to 5.5: the Stroop task is an oblique proxy for real deployment failures, the trigger article is opinion/editorial, and the carwash example is informal; the underlying paper is the genuine value.
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

