Security & Riskprompt injectionllmsadversarial attacksmodel security

Paper Shows LLM Role Confusion Enables Prompt Injection

|June 25, 2026|By LDS Team

7.2

Relevance Score

Paper Shows LLM Role Confusion Enables Prompt Injection

Research by Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell, accepted to ICML 2026, finds that LLMs identify text roles - system, user, tool, assistant - from writing style rather than from structural role tags. Bruce Schneier's security blog flagged the paper on June 25; Simon Willison also covered it in detail on June 22. The team's key empirical finding: 'destyling' a prompt injection attack - removing stylistic markers associated with the model's reasoning - drops average attack success rates from 61% to 10%. A new attack called CoT Forgery achieves near-60% jailbreak success by injecting fake chain-of-thought text that mimics the model's internal reasoning style. The authors conclude that without genuine role perception, prompt injection defense 'will remain a perpetual whack-a-mole game.'

What happened

Research accepted to ICML 2026 - "Prompt Injection as Role Confusion" by Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell - presents a mechanistic theory of why LLMs fail against prompt injection. Bruce Schneier's security blog flagged the paper on June 25; security commentator Simon Willison published a detailed breakdown on June 22. The core claim: LLMs cannot reliably perceive roles from structural tags because they have learned to infer role identity from writing style instead.

The mechanism - role probes

The team developed "role probes," linear classifiers trained on model activations to measure what role a model internally assigns to each token. In controlled experiments with gpt-oss-20b, stripping all role tags from a conversation leaves internal role perception nearly unchanged - the model still assigns high "CoTness" to text that sounds like reasoning, regardless of tag. When reasoning-style text is wrapped in user tags, the stylistic signal overrides the official label. Style actively overrides the true role.

Attack 1 - CoT Forgery

The team built an attack called CoT Forgery: injecting fake reasoning blocks written in the model's chain-of-thought style inside user messages or tool outputs. Because LLMs treat reasoning-like text as their own already-reached conclusions, they act on the injected logic without scrutiny. On a standard jailbreak benchmark, CoT Forgery raises attack success from near-zero to roughly 60% and transfers across multiple frontier models. The CoTness score of injected text - measured from activations before the model responds - predicts whether the attack will succeed.

Attack 2 - standard prompt injection

The same mechanism explains tool-injection attacks. Prepending "User: " to a malicious command buried in tool output raises the model's internal Userness score for that text, increasing the probability the model treats it as a real user instruction. Testing 212 phrasing variations confirmed that higher Userness scores predict higher attack execution rates.

Destyling as evidence

The clearest evidence that style drives role perception: removing specific words and syntax associated with the model's reasoning (destyling) drops attack success from 61% to 10%. The change is imperceptible to human readers but completely changes the model's internal perception of the text's role.

Implications

The authors frame the finding as structural. Role tags were originally a formatting convention that became the trust architecture of deployed LLMs. Because the learned boundary is stylistic and continuous rather than discrete, any text that sounds like a privileged role can exploit that trust. The paper warns that prompt injection defense will remain brittle without genuine role perception, and flags "subconscious steering" - using innocuous-seeming text to subtly shift model persona or recommendations at commercial scale - as a potentially larger threat than current cybersecurity-focused injection attacks. A May 2026 paper found Opus 4.5 and GPT-5.4 still failing 11% and 25% of the time, respectively, against automated attacks; human red-teamers achieve near-100% success rates against frontier models.

Key Points

1Paper accepted to ICML 2026 by Ye, Cui, and Hadfield-Menell shows LLMs assign roles from writing style, not tags, enabling injected text to impersonate privileged roles.
2New attack CoT Forgery injects reasoning-style text to achieve ~60% jailbreak success; destyling drops success from 61% to 10%, confirming style drives role perception.
3Authors warn prompt injection remains a 'whack-a-mole game' without genuine role perception, and flag commercial-scale subconscious steering as an emerging structural threat.

Scoring Rationale

ICML 2026 accepted paper with mechanistic evidence that LLMs derive role identity from writing style, not tags. Introduces role probes (linear classifiers on activations), CoT Forgery attack (~60% jailbreak success, transfers across frontier models), and destyling evidence (61% to 10% success drop). Directly relevant to deployed agentic systems; flagged by Schneier and Willison. Score of 7.2 reflects significant security research with strong empirical evidence; held below 7.5 pending cross-architecture replication at scale.

MoreLLMs news

Sources

Primary source and supporting public references used for this report.

5 sources

Primary sourceschneier.comSchneier on Security

View 4 more sources

Practice with real Ad Tech data

90 SQL & Python problems · 15 industry datasets

Used by DS/ML engineers at top companies

Active Search Campaigns by BudgetEasy

High CPC Clicks & Poor Landing PagesMedium

Campaign ROAS by Attribution ModelHard

250 free problems · No credit card

See all Ad Tech problems