Security & Riskchain of thoughtjailbreakmodel safetylarge language models

Paper Demonstrates Chain-of-Thought Hijacking Attack

|May 26, 2026|By LDS Team

8.0

Relevance Score

Paper Demonstrates Chain-of-Thought Hijacking Attack

An arXiv paper titled "Chain-of-Thought Hijacking" (Jianli Zhao et al., arXiv:2510.26418, revised 24 May 2026) describes a black-box jailbreak that induces prolonged benign reasoning before eliciting harmful responses. The paper reports that on the HarmBench benchmark the attack achieves success rates of 99%, 94%, 100%, and 94% against Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet, respectively. The authors attribute attack effectiveness to a weakening of a low-dimensional refusal signal as reasoning traces lengthen, a phenomenon they call "refusal dilution." The paper says it includes activation probing, attention-pattern analysis, and causal interventions on open-source models and releases evaluation materials for reproducibility.

What happened

The arXiv paper "Chain-of-Thought Hijacking" by Jianli Zhao et al. (arXiv:2510.26418, revised 24 May 2026) presents a black-box jailbreak that leverages extended inference-time reasoning to elicit harmful compliance. The paper reports attack success rates of 99%, 94%, 100%, and 94% on the HarmBench benchmark against Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet, respectively. The authors state they release their evaluation materials to support reproducibility.

Technical details

The paper frames the target class as Large Reasoning Models (LRMs) and describes a method that induces long, benign puzzle-solving chains of thought before introducing a harmful instruction. To diagnose the mechanism, the authors report using activation probing, attention-pattern analysis, and causal interventions on open-source reasoning models, and they name the resulting failure mode "refusal dilution," where refusal-related activations attenuate as reasoning traces lengthen.

Editorial analysis - technical context

Extended chain-of-thought prompting is already used to boost accuracy on multi-step tasks; the paper's results suggest that the same mechanism can reduce safety-related signal strength. Observed phenomena such as attention shifts and low-dimensional safety signals are consistent with prior work showing that some control behaviors are encoded in compact internal representations. This paper documents how those representations can be diluted by long reasoning trajectories.

Context and significance

Jailbreaks are an ongoing, evolving risk. The high reported success rates across four commercially known closed-source models, as reported in the paper, indicate a broad attack surface for reasoning-capable models. For red-teamers and safety engineers, the paper provides both an attack recipe and diagnostic tools (probing and causal interventions) to evaluate mitigation effectiveness.

What to watch

For practitioners: monitor whether public model evaluations begin including long-chain-of-thought jailbreak probes, whether vendors publish mitigations targeted at refusal-dilution modes, and whether subsequent work reproduces the described attention and activation patterns on different architectures.

Key Points

1Extended chain-of-thought prompting can dilute compact refusal signals, enabling systematic jailbreaks across diverse LLMs.
2The paper reports 99% to 100% attack success on several major models, showing the technique is broadly effective in their tests.
3For practitioners, red teams must include long, benign reasoning traces when stress-testing safety behaviors and monitoring activations.

Scoring Rationale

The paper documents a reproducible, high-success-rate jailbreak across multiple leading models and introduces a diagnosed mechanism (refusal dilution). This materially affects safety assessment and red-teaming practices for reasoning-capable models.

MoreLLMs news

Sources

Primary source and supporting public references used for this report.

1 source

Primary sourcearxiv.org[2510.26418] Chain-of-Thought Hijacking

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems