Models & Researchchain of thoughtevaluation methodsmodel scalingarxiv

Format Confound Distorts Chain-of-Thought Corruption Studies

|May 12, 2026

7.0

Relevance Score

According to an arXiv paper by Gabriel Garcia (arXiv:2605.10799), common "corruption" protocols used to evaluate chain-of-thought (CoT) faithfulness systematically confound format with computation. The paper reports that when chains end with explicit answer statements, corruption studies detect where the answer text appears rather than where the model actually computed it. On standard GSM8K chains, removing only the final answer statement collapses suffix sensitivity by about 19x at 3B (N=300, p=0.022), the paper reports. Conflicting-answer injections at 7B reduce causal-consumption (CC) accuracy to near-zero (<=0.02) across five architecture families, with a followed-wrong rate between 0.63-1.00 at 3B-7B and attenuation at larger scales (e.g., 0.300 at Phi-4-14B, ~0.01 at 32B). The paper proposes a three-prerequisite protocol as a minimum standard for corruption-based faithfulness studies.

What happened

According to the arXiv paper by Gabriel Garcia (arXiv:2605.10799), standard corruption studies for chain-of-thought (CoT) faithfulness are systematically confounded by explicit terminal answer statements. The paper reports that, on chains of the form "the answer is X," corruption methods detect the textual location of the answer rather than the site of computation.

Technical findings

The paper documents several empirical patterns on standard benchmarks. On GSM8K, removing only the final answer statement collapses suffix sensitivity by approximately 19x at 3B (N=300, p=0.022), per the submission. Conflicting-answer experiments at 7B drive causal-consumption accuracy to near-zero (<=0.02) across five architecture families, and the reported followed-wrong rate spans 0.63-1.00 at 3B-7B, attenuating with scale (for example, 0.300 at Phi-4-14B and ~0.01 at 32B). The authors report within-dataset and within-stable replications (e.g., 9.3x attenuation at 7B, N=76, p=7.8e-3; Qwen3-8B N=299, p=0.004) and a similar recovery pattern on MATH (DeepSeek-R1-7B: 10.9x suffix-survival recovery). Generation-time probes reportedly show early commitment <5%, while consumption-time behavior strongly follows explicit answer text. The paper proposes a three-prerequisite protocol: question-only control, format characterization, and all-position sweep.

Editorial analysis: This result is a methodological correction rather than a claim about model internals. For researchers using corruption or intervention sweeps, the paper highlights that answer-format artifacts can masquerade as computation-localization signals. Comparable evaluation protocols that omit format controls risk conflating format salience with causal importance.

Editorial analysis: For practitioners designing faithfulness evaluations or red-team probes, the paper's recommended minimum protocol provides practical checks to distinguish format-driven effects from genuine computational dependencies. Observers should treat past corruption-study conclusions that relied on terminal-answer chains with caution until re-evaluated under the proposed controls.

What to watch

Replication across more architectures, alternative datasets without explicit answer suffixes, and incorporation of the three-prerequisite protocol into benchmark suites will determine how broadly this confound reshapes CoT faithfulness claims.

Scoring Rationale

This paper identifies a concrete, replicable methodological confound that affects many CoT corruption studies, making it notable for researchers and evaluators. It does not present a new model or capability release, so its impact is important but domain-specific.