Anthropic Fixes Claude Code Performance Regression

Anthropic identified and fixed three product-layer changes that degraded Claude Code performance for some users. The company traced the regressions to a default reasoning-effort downgrade, a session-caching bug that cleared prior "thinking" every turn, and a system-prompt change that reduced verbosity and harmed code quality. Anthropic says the underlying API and inference layer were not affected and rolled back the changes in a patch (v2.1.116) on April 20, 2026, while resetting usage limits. The episode highlights how UI, agent harnesses, and prompt engineering can cause outsized user-facing regressions even when core model parameters remain unchanged.
What happened
Anthropic investigated complaints that Claude Code had become noticeably worse and found three separate product-layer changes that together produced a perceptible quality drop. The company confirmed the fixes were deployed in patch v2.1.116 on April 20, 2026 and said the API and inference layer were not impacted. "We never intentionally degrade our models," Anthropic wrote in its post-mortem.
Technical details
The three root causes were changes to default settings, a session caching bug, and an added system prompt. Each impacted different slices of traffic and model variants, producing inconsistent user experience.
- •On March 4 Anthropic changed Claude Code default reasoning effort from high to medium to reduce latency spikes in high mode; this tradeoff reduced reasoning depth for Sonnet 4.6 and Opus 4.6 and was reverted on April 7.
- •On March 26 a background change intended to clear older "thinking" for idle sessions introduced a bug that cleared that state every turn for the whole session, making the agent forget prior context and repeat itself; fixed on April 10.
- •On April 16 Anthropic added a system prompt instruction to reduce verbosity; combined with other prompt edits this materially harmed coding quality and was reverted on April 20. This affected Sonnet 4.6, Opus 4.6, and Opus 4.7.
Why these changes matter for practitioners
These are product-harness issues rather than model-weight changes. That means a stable inference stack can still produce degraded outputs when upstream wrappers, default parameters, session management, or system prompts are altered. Third-party testers and developers reported measurable drops; VentureBeat cited a BridgeMind benchmark slide showing Opus 4.6 accuracy falling from 83.3% to 68.3%, illustrating how surface regressions can be large and sudden when multiple product adjustments interact.
Operational and engineering lessons
The incident exposes several brittle areas in model deployment pipelines. First, default parameter changes (for example reasoning effort) are a high-leverage knob and should be canaried separately across traffic and evals. Second, session hygiene and caching require stricter end-to-end tests; a change meant to improve latency created a persistent correctness bug. Third, system prompts and verbosity constraints need targeted functional tests that include domain-specific evals such as code synthesis and long-form reasoning. Anthropic reset usage limits and says it will change processes to reduce the chance of similar regressions.
Context and significance
The episode reinforces a recurring theme in production LLM systems: the distinction between model-core stability and product-layer fragility. Competitors and customers watch such incidents closely because they affect trust, SLAs, and benchmark claims. For teams building agents, the takeaway is to instrument and benchmark not only the model API but the full harness, including prompt mutations, default-effort semantics, session state management, and UI-level defaults.
What to watch
Expect Anthropic to publish follow-up process changes and improved canarying and telemetry. Independent benchmarks and developer reports will be the next signal to confirm restoration of prior quality. Also watch whether other vendors adopt stricter guardrails for default-effort knobs and system-prompt rollouts, since these are common vectors for regressions.
Scoring Rationale
This is a notable operational incident: it did not change model weights but exposed high-risk product-layer failure modes that affect developer trust and enterprise deployments. It does not rise to industry-shaking model research news, but the practical implications for productionizing LLMs are significant.
Practice with real Ad Tech data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Ad Tech problems

