Models & Researchlarge language modelscognitive limitationsstroop testchatgpt

Critics Highlight AI Failures on Simple Tasks

|
5.5
Relevance Score
Critics Highlight AI Failures on Simple Tasks
Photo: wnd.com · rights & takedowns

A peer-reviewed PNAS Nexus study (Patel, Wang, and Fan, CUNY) documents a structural gap in transformer architecture that practitioners should understand: LLMs lack the hard top-down inhibitory mechanism needed to suppress strongly trained priors under extended cognitive load. The study used the color Stroop task - naming ink color when word text conflicts with that color - to measure executive control. GPT-4o held 91 percent accuracy at five incongruent words, then collapsed to near-zero (approximately 1 percent per researcher quotes to PsyPost, or 15 percent in the pure-incongruent condition per Neuroscience News) by 40 words; Claude 3.5 Sonnet dropped to roughly 10-24 percent at 40 words depending on condition. The same catastrophic failure replicated on frontier models GPT-5, Claude Opus 4.1, and Gemini 2.5. A separately viral "carwash prompt" - where ChatGPT gives opposite walk-or-drive answers to near-identical questions about a 100-meter trip - illustrates the same surface phenomenon informally. A WND/RealClearWire opinion piece by Ross Pomeroy used these examples to dispute Marc Andreessen's claim that AGI is already here.

Why this matters for practitioners

The important finding is not that AI "fails" a psychology test - it is the architectural reason it fails. A June 2026 PNAS Nexus study provides controlled evidence that transformer-based LLMs lack an explicit top-down inhibitory mechanism for conflict resolution. That gap becomes critical in any deployment scenario requiring a model to maintain and enforce a non-default rule across long context or under competing signals: strict output-format constraints, multi-step agent instructions, adversarial robustness, or any task where the model must suppress a strongly trained prior (such as word reading, common-sense defaults, or learned phrasings) to follow an instruction that conflicts with it.

The study

"Deficient executive control in transformer attention" (Suketu Chandrakant Patel, Hongbin Wang, and Jin Fan; published in PNAS Nexus, DOI: 10.1093/pnasnexus/pgag149) tested GPT-4o and Claude 3.5 Sonnet on the color Stroop task - participants (or models) see color words printed in mismatched ink and must name the ink color, not read the word. Humans handle long lists by applying top-down executive control to suppress automatic reading. Models do not have that mechanism.

Results

GPT-4o reached 91 percent accuracy on five-word incongruent lists, showing a typical Stroop-style response. As lists grew to 20 and 40 words, accuracy collapsed: per researcher Suketu Patel quoted directly by PsyPost, GPT-4o "plummeted to just 1 percent on both the twenty-word and forty-word lists." Neuroscience News, citing the same paper, reported 57 percent at 10 words and 15 percent at 40 words for the pure-incongruent condition; the two figures likely reflect different test conditions. Claude 3.5 Sonnet maintained stability slightly longer but dropped to 10-24 percent at 40 words depending on condition. In mixed trials (matching and mismatched colors combined), performance collapsed to near-zero for all models. Identical failure patterns were confirmed in GPT-5, Claude Opus 4.1, and Gemini 2.5.

What it reveals architecturally

Per Patel (PsyPost interview): "The study shows that, at the signal level, the ability to detect and resolve the conflict degrades because transformer attention can only impose a soft constraint on that automatic reading, rather than the hard one that an executive control mechanism would provide." Biological attention has three dissociable systems - alerting, orienting, and executive control; transformers implement a form of the first two but not the third. As Patel explains: "Scaling to larger models toward ASI, the implicit wager is that this gating mechanism - what neuroscience calls executive control - will emerge from more scale and data without any dedicated architecture."

The scaffolding workaround and its limits

Some frontier models can write and run code to pass the Stroop task - GPT-5 in Thinking mode does this. The researchers call it "just avoiding it, papering over a deficiency at the signal level." Code generation and chain-of-thought bypass the test but do not fix the structural conflict-resolution gap. The authors are exploring how executive control could be built directly into transformer architecture for long-horizon instruction following.

The carwash prompt and viral demonstrations

Neurologist Steven Novella (The Skeptics' Guide to the Universe podcast) described a reproducible informal test: asking ChatGPT whether to walk or drive to a carwash 100 meters away returned opposite answers depending on subtle wording differences. Covered by Cybernews, IBM Think, and Newsweek, the carwash example illustrates the same surface-level prompt-sensitivity, though it is informal and anecdotal rather than controlled. WND/RealClearWire columnist Ross Pomeroy cited both the Stroop study and the carwash example to dispute Marc Andreessen's claim that AGI has already arrived - a framing that is editorial rather than scientific.

Key Points

  • 1A PNAS Nexus study (Patel et al., CUNY) shows GPT-4o's Stroop-task accuracy collapsed from 91 percent at five words to near-zero by 40 words, replicating across GPT-5, Claude Opus 4.1, and Gemini 2.5.
  • 2The failure is architectural: transformer attention applies only soft constraints on prepotent word-reading priors, lacking the hard top-down inhibitory control human executive attention sustains across long sequences.
  • 3Code generation and chain-of-thought can bypass the Stroop task but do not fix the structural gap in conflict resolution - a problem that carries into any long-horizon goal-maintenance deployment.

Scoring Rationale

A controlled PNAS Nexus study identifies a structural executive-control gap in transformer architectures, with near-zero Stroop-task accuracy at 40-word lists, replicating across frontier models including GPT-5 and Claude Opus 4.1. Relevant to practitioners on goal-maintenance, constraint adherence, and agent robustness. Pulling from 5.8 to 5.5: the Stroop task is an oblique proxy for real deployment failures, the trigger article is opinion/editorial, and the carwash example is informal; the underlying paper is the genuine value.

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems