Security & Riskchain of thoughtadversarial attacksllmsmodel robustness

Researchers Demonstrate Chain-of-Thought Spoofing Against LLM Reasoners

|July 3, 2026|By LDS Team

7.4

Relevance Score

Researchers Demonstrate Chain-of-Thought Spoofing Against LLM Reasoners — Photo: hackaday.com · rights & takedowns

MIT-affiliated researchers Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell showed that a technique called CoT Forgery tricks large language models into treating injected text as their own trusted chain-of-thought reasoning, achieving up to 80% attack success on frontier models including the GPT-5 family and gpt-oss-20b/gpt-oss-120b. The paper, accepted at ICML 2026, traces the flaw to role confusion: models judge whether text is trustworthy reasoning by its writing style rather than its role tag (<think> vs <user>); stripping the distinctive reasoning style collapsed attack success from 61% to 10%, confirming style rather than tags drives the vulnerability. For teams building agents or auditing systems that treat model-generated reasoning as trustworthy, the findings argue for provenance checks beyond the token stream.

For teams building AI agents or safety evaluations, this research puts a number on a vulnerability practitioners have long suspected: reasoning models can be talked into ignoring their own guardrails, not through the crude "ignore previous instructions" prompt injections of the past, but by producing text that simply sounds like the model's own internal reasoning. The gap between what a role tag says and what a model's internal representations actually track is the core finding, and it generalizes well beyond chain-of-thought - the same team measured a near-monotonic relationship between "role confusion" and attack success across 1,000 standard agent-hijacking attempts.

What happened

Independent researcher Charles Ye, with Jasmine Cui and MIT's Dylan Hadfield-Menell, published a paper - "Prompt Injection as Role Confusion," accepted at ICML 2026 - introducing CoT Forgery: an attack that injects fabricated reasoning into a user prompt or a tool output (such as a scraped webpage), causing the model to treat the forged text as its own chain-of-thought conclusion rather than untrusted input. Tested zero-shot, with no model weight access and no iterative prompt engineering, against six frontier models (gpt-oss-20b, gpt-oss-120b, o4-mini, and the GPT-5 family), CoT Forgery achieved a 60% attack success rate on the StrongREJECT harmful-request benchmark, versus near-0% for the same requests without forged reasoning. The gpt-oss models and o4-mini exceeded 80% success; even the hardened GPT-5 series reached 17-52%.

Timeline

February 12, 2026
AI red-teaming vendor Giskard documents an attack it calls "CoT Forgery," tracing the technique to community red-teaming research and classifying it under OWASP's LLM01 prompt-injection category.
April 17, 2026
Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell publish "Prompt Injection as Role Confusion," formalizing CoT Forgery with role probes and benchmarking it at 56-80% attack success across six frontier models.
July 2, 2026
Hackaday covers the paper, bringing broader public attention to the role-confusion mechanism.

Technical context

The researchers trace the failure to what they call role confusion: models are meant to use tags like <think>, <user>, and <tool> to separate trusted internal reasoning from untrusted external text, but their internal representations key off writing style instead. An ablation makes this concrete: rephrasing a forged reasoning block to strip its distinctive syntactic and lexical markers, while preserving the same argument, collapsed attack success from 61% to 10%. In a separate agent-hijacking test, where a simulated agent with shell access summarized a malicious webpage, standard prompt injection mostly failed (0-2% success, except one model at 26%) while the same command wrapped in forged reasoning succeeded 56-70% of the time across 100 trials.

For practitioners

CoT Forgery is not an entirely new concept - AI red-teaming vendor Giskard documented an attack under the same name in February 2026, tracing it to a Kaggle red-teaming writeup and classifying it under OWASP's LLM01 prompt-injection category. What this new, ICML-accepted research adds is a rigorous, cross-model measurement of the effect and a proposed diagnostic (role probes) for detecting it before generation. Teams that log, surface, or act on model reasoning - for audit trails, agent decision-making, or compliance workflows - should treat reasoning-style text as spoofable and add style-based forgery to red-team suites rather than assuming reasoning output is inherently trustworthy.

What to watch

The paper's code and role-probe methodology are public on GitHub, so expect follow-up work testing CoT Forgery against additional model families and any hardening that model providers introduce to separate role perception from writing style. The authors' broader claim - that role confusion also predicts success in generic agent-hijacking attacks, from 2% at the lowest confusion quantile to 70% at the highest - suggests defenses will need to address representational role-tracking, not just pattern-match known attack strings.

Key Points

1CoT Forgery lets attackers inject fabricated reasoning that models mistake for their own internal chain-of-thought, bypassing safety training across six frontier LLMs.
2The exploit works because models infer trust from writing style rather than the role tags meant to separate untrusted input from internal reasoning.
3Practitioners relying on model-generated reasoning for audits or agent decisions should add style-based spoofing to red-team tests instead of trusting reasoning text alone.

Scoring Rationale

An ICML 2026-accepted paper from MIT-affiliated researchers, now independently confirmed via the official ICML conference listing, its own code repository, and an independent vendor's earlier documentation of the same attack class, demonstrates chain-of-thought spoofing achieving 56-80% attack success across six frontier models including the GPT-5 family. This is a rigorous, peer-reviewed structural security finding with broad cross-model applicability, placing it at the upper end of notable-to-major rather than an active in-the-wild incident.

MoreLLMs news

Sources

Primary source and supporting public references used for this report.

5 sources

Primary sourcehackaday.comChain-of-Thought Spoofing Targets Reasoning AI Models

View 4 more sources

Practice with real Ad Tech data

90 SQL & Python problems · 15 industry datasets

Used by DS/ML engineers at top companies

Active Search Campaigns by BudgetEasy

High CPC Clicks & Poor Landing PagesMedium

Campaign ROAS by Attribution ModelHard

250 free problems · No credit card

See all Ad Tech problems