One of the quotes Anthropic chose to publish in its own report came from an engineer describing how their job had changed. "I started leaning hard into Claudifying about a year ago," the employee wrote. "It's now been about five months since I last wrote any code myself."
That sentence appears inside a piece titled "When AI builds itself," published on June 5 by the Anthropic Institute and co-authored by company co-founder Jack Clark. It is not a manifesto or a product launch. It is a data dump, drawn from public benchmarks and previously unreported numbers from inside the company, and it makes one argument with unusual directness: AI is already speeding up the development of AI, the trend shows no sign of bending, and the world is not ready for where it points.
The destination has a name. Recursive self-improvement is the point at which an AI system can fully, autonomously design and develop its own successor. Anthropic is careful to say it is not there yet and that the outcome is not inevitable. Then it lays out the evidence that the distance left to cover is shrinking fast.
The Numbers Anthropic Put on the Table
The headline figure is about Anthropic's own engineering. As of May 2026, more than 80% of the code merged into Anthropic's production codebase was written by Claude. Before Claude Code launched in research preview in February 2025, that share was in the low single digits. Company leadership has publicly put the number at 90% or higher when counting scripts and experimental code; the 80% figure is the more conservative measure of lines that actually reach production.
The shift shows up in output per person. In the second quarter of 2026, the typical Anthropic engineer was merging eight times as much code per day as in 2024. The curve was flat for the company's first four years, started climbing in 2025 when Claude began running code rather than just suggesting it, and steepened again in 2026 as models began working autonomously over longer stretches.
The rest of the evidence comes from how fast the underlying capability is moving:
| Measure | Then | Now |
|---|---|---|
| Anthropic code written by Claude | low single digits (early 2025) | more than 80% (May 2026) |
| Code merged per engineer per day | baseline (2024) | 8x (Q2 2026) |
| Task length AI completes reliably | about 4 minutes (Claude Opus 3, March 2024) | about 12 hours (Claude Opus 4.6) |
| Speed-up on a fixed code-optimization task | about 3x (Claude Opus 4, May 2025) | about 52x (Mythos Preview, April 2026) |
| Success on open-ended coding tasks | about 26% (late 2025) | 76% (May 2026) |
Each row is its own small shock. The task-length number comes from METR, an independent evaluation group, which finds the length of work an AI can finish reliably on its own is now doubling roughly every four months, up from every seven. By METR's measure, Claude Opus 3 could handle tasks that take a person about four minutes in March 2024. Two years later, Anthropic's Mythos Preview could work, in METR's words, for "at least" 16 hours, "at the upper end of what METR can measure." Standard software-engineering benchmarks like SWE-bench went from low single-digit scores to effectively solved in two years.
The Capability Curve Keeps Bending the Right Way
Two results in the paper go beyond raw speed and into the thing that actually defines research: judgment.
In April 2026, Anthropic handed Claude-powered agents an open AI-safety problem, roughly whether a weaker model can reliably supervise a stronger one, and left them to solve it. The agents proposed hypotheses, tested them, shared findings across parallel runs, and iterated. Two human researchers working the same problem over about a week recovered roughly 23% of the available performance gap. The agents recovered 97% over 800 cumulative hours, burning about $18,000 in compute. Humans still chose the problem and wrote the scoring rubric, and the result did not transfer cleanly to production-scale models. Within those limits, though, the agents designed every experiment themselves.
The same pattern shows up in smaller decisions. Anthropic took real Claude Code sessions where a researcher had taken a wrong turn, showed various models only the work up to that point, and asked what they would do next. A separate Claude that could see how the session actually ended judged the answers. In November 2025, the best model beat the human's next move 51% of the time. By April 2026, that rose to 64%. The company is blunt that these moments were chosen because the human had room to improve, so it is not a clean head-to-head. It reads it as an early signal that models are getting better at the chain of judgment calls research is made of.
There is also a maintenance dividend that any engineer will recognize. In April 2026, Claude shipped more than 800 fixes that cut one class of API errors by a factor of 1,000. The engineer overseeing it estimated a human would have needed four years, because the work meant holding an enormous amount of unfamiliar context at once. Anthropic now runs an automated Claude reviewer on every proposed change, and a retrospective found it would have caught about a third of the bugs behind past claude.ai incidents before they reached production. Claude is catching mistakes made by some of the best systems engineers in the world, an echo of the self-checking ability Anthropic showcased when Claude Opus 4.8 started catching its own bugs.
The One Job the Data Says Is Left
Read together, the paper describes the human role narrowing at every step. Writing the code, running the experiment, producing the result: those now cost almost nothing in human time. What remains, for now, is taste. Deciding which problem is worth solving, which result to trust, and when an approach is a dead end is the work Anthropic says its models are still worst at.
That has a concrete consequence for how teams operate, and it is the part practitioners should sit with. Once AI writes code as well as a human and far faster, human review becomes the bottleneck. Anthropic says it has already hit this wall internally: it can generate change faster than people can read it. The constraint on an AI-heavy team stops being how fast you can build and becomes how fast you can verify. The skill that compounds is no longer typing. It is knowing what to ask for and recognizing when the answer is wrong.
This is the same trajectory Anthropic co-founder Jack Clark sketched last month when he put 60% odds on AI building its own successor by 2028. The June 5 paper is the data behind that bet. It is not the only company seeing it, either: the AI coding startup Cognition recently told investors its own agent writes 89% of its code.
The Other Side
Anthropic spends a full section arguing against itself, and the objections are worth taking seriously.
The cleanest one is measurement. Lines of code is a famously bad proxy for value, and the company admits the 8x figure "is almost certainly an overstatement of the true productivity gain." It also cites outside research from METR showing developers tend to overestimate how much AI speeds them up. So the productivity numbers should be read as direction, not magnitude.
The deeper objection is that the curve may be an S, not an exponential. The judgment that separates a competent researcher from a great one might be exactly the capability that does not emerge from scaling up compute and data, in which case progress stalls until someone invents a successor to the Transformer architecture every frontier model still runs on. Even the supply of chips and electricity could bind before intelligence does. Anthropic says it does not find the stall scenario likely, but it lists it.
Then there is the uncomfortable optics question the paper does not raise but its critics will. This is a company finalizing an IPO at a valuation near a trillion dollars, a valuation that rests on the premise its models will keep getting dramatically more powerful, now publishing a paper arguing that exact trajectory might be dangerous enough to require a coordinated slowdown. A company can believe both things at once. It can also benefit from being the firm that warned everyone first.
The Bottom Line
The paper's real proposal is a way to stop. Anthropic wants the industry, and eventually governments, to build verification systems so that frontier labs could credibly slow or pause development together, each able to confirm the others actually did. It compares the problem to Cold War arms control and concedes it is harder, because a training run is far easier to hide than a missile silo. A pause by one lab alone, it notes, just changes who leads.
The metaphor that stuck, in coverage from CNN and others, was a "brake pedal." As Clark put it, "the world needs to do some thinking and we need to eventually develop some new regulations that allow us to be confident in these systems." Anthropic says it will spend the coming months convening policymakers, researchers, and rival labs to figure out what such a system would even look like, and will publish what comes out.
For the engineers whose jobs this paper is quietly describing, the more honest summary may be the other quote Anthropic chose to print, from an employee watching the work change underneath them. "On days where everything works well, I can't help but think nothing I do matters, everything is automated and better and faster than I ever will be," they wrote. "But then there are days where everything breaks and I don't understand why, and I realize I have no idea what I've been up to anymore."
Sources
- When AI builds itself (The Anthropic Institute, by Marina Favaro and Jack Clark, June 5, 2026)
- Anthropic warns that AI will soon be able to improve itself without human intervention (CNN Business, June 5, 2026)
- Anthropic warns that AI needs a 'brake pedal' (UPI, June 5, 2026)
- Anthropic urges AI industry to develop 'brake pedal' as self-improving systems approach (MacDailyNews, June 5, 2026)
- Measuring AI Ability to Complete Long Tasks (METR)