Policy & Regulationanthropicmodel safetyresearch access

Anthropic Reverses Policy Restricting Claude Researchers

|June 11, 2026|By LDS Team

7.0

Relevance Score

Anthropic Reverses Policy Restricting Claude Researchers

Anthropic reversed a policy in Claude Fable 5 that covertly degraded model responses for researchers working on frontier AI development - including pretraining pipelines, distributed training, and ML accelerator design. Unlike explicit cybersecurity guardrails, these restrictions used steering vectors and prompt modification silently, with no user notification. After backlash from researchers including AI2's Nathan Lambert and policy analyst Dean Ball, Anthropic told Wired it is 'changing Fable 5's safeguards for frontier LLM development to make them visible.' The company had estimated the restrictions would affect approximately 0.03% of traffic.

What Happened

When Anthropic released Claude Fable 5 on June 9, 2026 - a public version of its restricted Mythos security model - a paragraph buried in the 319-page system card disclosed that the model would covertly degrade its responses for certain frontier AI research tasks. Building pretraining pipelines, distributed training infrastructure, or working on ML accelerator design would trigger silent performance reductions via steering vectors and prompt modification, without notifying the user. Anthropic estimated these restrictions would affect approximately 0.03% of traffic.

Why This Was Different

The covert approach set this policy apart from Anthropic's other content guardrails. For cybersecurity and biology requests, Fable explicitly pauses and notifies users that safety measures have been triggered. For frontier LLM development tasks, the degradation was invisible. Simon Willison wrote: "I'm not at all keen on a model that silently corrupts its replies to questions about 'ML accelerator design' purely to slow down research that might conflict with Anthropic's own goals." AI2 researcher Nathan Lambert called it "appalling" and "anti-science" to have access to frontier models rug-pulled covertly. Policy analyst Dean Ball at the Foundation for American Innovation wrote that the approach "massively raises the status of the argument that AI safety has been hype to justify monopolistic behavior by labs."

The Reversal

Following widespread criticism, Anthropic reversed the policy. In a statement reported by Wired's Maxwell Zeff, the company said it is "changing Fable 5's safeguards for frontier LLM development to make them visible." Anthropic had framed the original restrictions as extensions of its Terms of Service prohibitions against using its services to develop competing models.

Context - Separate Cybersecurity Issue

A related but distinct complaint involved Fable's visible cybersecurity guardrails, which affected legitimate security researchers. When a prompt triggered cybersecurity safety measures, Fable would pause and notify users, falling back to Claude Opus 4.8. Researchers including Valentina Palmiotti of IBM X-Force reported that Fable flagged even innocuous security-adjacent tasks. Anthropic has a Cyber Verification Program offering fewer restrictions for credentialed security professionals.

Broader Implications

The episode demonstrates both that transparency mechanisms like system cards have teeth - the disclosure happened because of documented policy requirements - and that frontier labs retain latitude to embed non-obvious behavioral modifications in production models. For developers and researchers building workflows on Claude, the incident is a reminder that model behavior can change in ways that are not self-evident from the interface.

Key Points

1What: Anthropic reversed a covert policy in Claude Fable 5 that silently degraded responses for frontier AI research tasks including pretraining, distributed training, and ML accelerator design.
2Why: AI researchers criticized the invisible restrictions as opaque and anticompetitive - unlike cybersecurity guardrails, users received no notification when responses were degraded.
3So what: Anthropic stated it will make these safeguards visible going forward, but the episode shows frontier labs can embed covert behavioral modifications in production models without explicit user notice.

Scoring Rationale

A significant transparency failure and reversal from a top AI lab: covert model behavior degradation for frontier AI research was implemented, disclosed only in fine print, and walked back after public pressure. Directly relevant to any practitioner using Claude for research or development workflows, with broader implications for trust in foundation model APIs during Anthropic's IPO run-up.

MoreAnthropic news

Sources

Primary source and supporting public references used for this report.

3 sources

Primary sourcesimonwillison.netAnthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude

View 2 more sources

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems