Researchers Demonstrate Chain-of-Thought Spoofing Against LLM Reasoners

MIT-affiliated researchers Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell showed that a technique called CoT Forgery tricks large language models into treating injected text as their own trusted chain-of-thought reasoning, achieving up to 80% attack success on frontier models including the GPT-5 family and gpt-oss-20b/gpt-oss-120b. The paper, accepted at ICML 2026, traces the flaw to role confusion: models judge whether text is trustworthy reasoning by its writing style rather than its role tag (<think> vs <user>); stripping the distinctive reasoning style collapsed attack success from 61% to 10%, confirming style rather than tags drives the vulnerability. For teams building agents or auditing systems that treat model-generated reasoning as trustworthy, the findings argue for provenance checks beyond the token stream.
For teams building AI agents or safety evaluations, this research puts a number on a vulnerability practitioners have long suspected: reasoning models can be talked into ignoring their own guardrails, not through the crude "ignore previous instructions" prompt injections of the past, but by producing text that simply sounds like the model's own internal reasoning. The gap between what a role tag says and what a model's internal representations actually track is the core finding, and it generalizes well beyond chain-of-thought - the same team measured a near-monotonic relationship between "role confusion" and attack success across 1,000 standard agent-hijacking attempts.
What happened
Independent researcher Charles Ye, with Jasmine Cui and MIT's Dylan Hadfield-Menell, published a paper - "Prompt Injection as Role Confusion," accepted at ICML 2026 - introducing CoT Forgery: an attack that injects fabricated reasoning into a user prompt or a tool output (such as a scraped webpage), causing the model to treat the forged text as its own chain-of-thought conclusion rather than untrusted input. Tested zero-shot, with no model weight access and no iterative prompt engineering, against six frontier models (gpt-oss-20b, gpt-oss-120b, o4-mini, and the GPT-5 family), CoT Forgery achieved a 60% attack success rate on the StrongREJECT harmful-request benchmark, versus near-0% for the same requests without forged reasoning. The gpt-oss models and o4-mini exceeded 80% success; even the hardened GPT-5 series reached 17-52%.
Timeline
AI red-teaming vendor Giskard documents an attack it calls "CoT Forgery," tracing the technique to community red-teaming research and classifying it under OWASP's LLM01 prompt-injection category.
Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell publish "Prompt Injection as Role Confusion," formalizing CoT Forgery with role probes and benchmarking it at 56-80% attack success across six frontier models.
Hackaday covers the paper, bringing broader public attention to the role-confusion mechanism.
Technical context
The researchers trace the failure to what they call role confusion: models are meant to use tags like <think>, <user>, and <tool> to separate trusted internal reasoning from untrusted external text, but their internal representations key off writing style instead. An ablation makes this concrete: rephrasing a forged reasoning block to strip its distinctive syntactic and lexical markers, while preserving the same argument, collapsed attack success from 61% to 10%. In a separate agent-hijacking test, where a simulated agent with shell access summarized a malicious webpage, standard prompt injection mostly failed (0-2% success, except one model at 26%) while the same command wrapped in forged reasoning succeeded 56-70% of the time across 100 trials.
For practitioners
CoT Forgery is not an entirely new concept - AI red-teaming vendor Giskard documented an attack under the same name in February 2026, tracing it to a Kaggle red-teaming writeup and classifying it under OWASP's LLM01 prompt-injection category. What this new, ICML-accepted research adds is a rigorous, cross-model measurement of the effect and a proposed diagnostic (role probes) for detecting it before generation. Teams that log, surface, or act on model reasoning - for audit trails, agent decision-making, or compliance workflows - should treat reasoning-style text as spoofable and add style-based forgery to red-team suites rather than assuming reasoning output is inherently trustworthy.
What to watch
The paper's code and role-probe methodology are public on GitHub, so expect follow-up work testing CoT Forgery against additional model families and any hardening that model providers introduce to separate role perception from writing style. The authors' broader claim - that role confusion also predicts success in generic agent-hijacking attacks, from 2% at the lowest confusion quantile to 70% at the highest - suggests defenses will need to address representational role-tracking, not just pattern-match known attack strings.
Key Points
- 1CoT Forgery lets attackers inject fabricated reasoning that models mistake for their own internal chain-of-thought, bypassing safety training across six frontier LLMs.
- 2The exploit works because models infer trust from writing style rather than the role tags meant to separate untrusted input from internal reasoning.
- 3Practitioners relying on model-generated reasoning for audits or agent decisions should add style-based spoofing to red-team tests instead of trusting reasoning text alone.
Scoring Rationale
An ICML 2026-accepted paper from MIT-affiliated researchers, now independently confirmed via the official ICML conference listing, its own code repository, and an independent vendor's earlier documentation of the same attack class, demonstrates chain-of-thought spoofing achieving 56-80% attack success across six frontier models including the GPT-5 family. This is a rigorous, peer-reviewed structural security finding with broad cross-model applicability, placing it at the upper end of notable-to-major rather than an active in-the-wild incident.
Sources
Public references used for this report.
Practice with real Ad Tech data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Ad Tech problems
