On Wednesday morning in San Francisco, Anthropic's Chief Product Officer Ami Vora walked onto the Code with Claude stage and told a room of developers that their AI agents were about to start dreaming.
She meant it almost literally. The new feature, available in research preview on the Claude Platform, is called dreaming. It is a scheduled process that runs between agent sessions, reviews everything an agent did in its last job, pulls patterns out of those sessions, and writes new memory entries that the next session can use. Anthropic compares it to hippocampal memory consolidation, the way a human brain replays the day's events during sleep and decides what to keep.
Legal-AI startup Harvey, one of Anthropic's most visible enterprise customers, had been running dreaming on its own agents before the public launch. According to Anthropic's official announcement, Harvey's task completion rates rose roughly 6x in internal testing once dreaming was turned on. The catch the startup kept hitting before that was small and ordinary: agents kept forgetting filetype quirks and tool-specific workarounds between sessions, so the same legal-drafting jobs failed in the same way over and over. With dreaming, the workarounds stuck.
The Conference That Was Mostly About Memory
Code with Claude San Francisco ran on May 6, the first of three stops on Anthropic's developer-conference tour. London follows on May 19. Tokyo follows on June 10. The keynote slot belonged to Vora, who replaced Mike Krieger as Anthropic's product chief earlier this year. Krieger now co-leads Anthropic Labs.
The day produced a stack of announcements, but the through-line was unmistakable. Anthropic wants Claude agents to do more work without a human watching, and the company has decided the bottleneck is memory.
Dreaming was one of three Managed Agents features that shipped on Wednesday. The others were outcomes, a self-grading loop in which a separate evaluator scores an agent's output against a written rubric and tells it what to fix, and multiagent orchestration, which lets a lead agent fan a job out to specialist subagents running in parallel. Outcomes and multiagent orchestration both went into public beta. Dreaming stayed in research preview, with developers needing to request access.
Vora's framing of the system was direct. "Memory lets each agent capture what it learns as it works," Anthropic wrote in the official launch post. "Dreaming refines that memory between sessions, pulling shared learnings across agents and keeping it up-to-date."
How Dreaming Actually Works
Dreaming does not change Claude's model weights. That part matters. The technique is closer to a structured note-taking ritual than to training.
When a developer enables dreaming on a Managed Agent, Anthropic schedules a background process that periodically reads through the agent's recent sessions and its memory store. The process looks for three kinds of patterns: recurring mistakes the agent keeps making, workflows the agent converges on across different jobs, and preferences that have emerged across a team of agents. It then rewrites the memory store, condensing what is now stale, promoting what is now load-bearing.
Developers can let dreaming update memory automatically, or they can require human review before any change lands.
The behavior matters most for what Anthropic calls long-running work: jobs that span many sessions, days, or operators, where the same agent or fleet of agents keeps returning to the same problem space. The classic failure mode for an LLM agent in that setting is that every session starts fresh, so every session relearns the same recurring quirks. Dreaming attacks that failure mode directly by giving the agent a curated set of notes from its own past.
The published evidence so far is limited to customer testimonials. Anthropic released no independent benchmark with the launch.
What Outcomes And Multiagent Orchestration Add
The second feature, outcomes, addresses a different agent failure mode: the agent finishes a task, but the output does not meet the developer's quality bar, and nobody catches it until a human reviews the result hours later.
With outcomes, the developer writes a rubric, in plain language, describing what a successful output looks like. The agent does its work. Then a separate grader, running in its own context window so its judgment is not contaminated by the agent's reasoning, scores the output against the rubric and tells the agent exactly what to fix. The agent then takes another pass.
Anthropic's internal benchmarks claim outcomes improved task success by up to 10 points over a standard prompting loop. The biggest gains were on the hardest problems. File-generation quality, the company's data showed, rose by 8.4% on .docx outputs and 10.1% on .pptx outputs. Developers can also wire outcomes to a webhook, so the agent runs, the grader signs off, and the developer gets notified only when the output is good enough.
Multiagent orchestration is the third piece. A lead agent breaks a complex job into chunks. It hands each chunk to a specialist subagent with its own model, prompt, and tools. The specialists run in parallel on a shared filesystem, contribute back to the lead agent's context, and remain individually traceable in the Claude Console.
The shape of that workflow is now visible inside Netflix's platform team. According to Anthropic, the team built an analysis agent that processes build logs from hundreds of source repositories. Multiagent orchestration lets the lead agent fan the batch out to subagents that scan in parallel and report back only on the patterns worth acting on. Spiral, a writing tool built by the publication Every, uses a similar structure: a Haiku-based lead agent fields incoming requests and delegates drafting to Opus-based subagents. When a user asks for multiple drafts, the Opus subagents run side by side. The drafts only return to the user if they clear the outcomes rubric scored against Every's editorial principles.
Wisedocs, a document-review startup, reported that reviews now run 50% faster since adopting outcomes for grading.
The Customer Numbers Anthropic Disclosed
| Customer | Feature in Use | Reported Result |
|---|---|---|
| Harvey | Dreaming | ~6x rise in task completion rates in internal tests |
| Netflix platform team | Multiagent orchestration | Analyzes build logs from hundreds of sources in parallel |
| Spiral by Every | Multiagent orchestration + outcomes | Lead agent on Haiku, drafting subagents on Opus; drafts gated by outcomes rubric |
| Wisedocs | Outcomes | 50% faster reviews against internal guidelines |
| Internal Anthropic benchmarks | Outcomes | +10 pts task success; +8.4% docx; +10.1% pptx |
The 6x claim from Harvey is the headline number, but it is also the one Anthropic is publishing without an external benchmark to back it up. Practitioners will want to know whether Harvey's tests are representative of broader agent workflows or specific to the long-form legal-drafting tasks where the company already had a clear pre-dreaming failure mode.
How It Unfolded This Week
The Skeptical Read
Not everyone buys the framing. Multiple AI engineers writing about the launch noted that dreaming sounds more novel than it is. The technique reduces to scheduled memory pruning and pattern extraction, applied to the LLM agent's persistent notes rather than to anything inside the model itself.
That framing is not wrong. The model weights do not change. The agent is not learning in the strict sense of gradient updates. What the agent gets is a curated set of plain-text notes that summarize what worked, what failed, and what to try next.
The harder question is whether that distinction matters in practice. If an agent that runs every day on the same workflow can avoid the same recurring failure for the eleventh time because dreaming wrote a note last night, the engineer who was paged about the previous ten failures will not care what the technique reduces to.
A second concern, raised by some security researchers, is that giving agents structured persistent memory expands the attack surface for prompt-injection and memory-poisoning attacks. If a malicious input can convince an agent that the wrong instruction is the right one, dreaming may consolidate that wrong instruction into the agent's long-term memory store, where it will be applied to future sessions automatically. Anthropic's documentation flags the issue and recommends human review on memory updates for high-stakes workflows.
The company that built the legal AI startup that pioneered dreaming made the same point in its own framing. Harvey told Anthropic that dreaming worked best when paired with a tight outcomes rubric, so that any drift in memory would be caught by the grader on the next run.
What This Means For Practitioners
For ML engineers and AI engineers building production agents, the practical change is straightforward. Three pieces of plumbing that previously had to be assembled by hand, persistent memory, self-grading, and parallel subagent dispatch, are now first-class primitives on Anthropic's platform.
Teams that were already building agent systems on Claude with bespoke memory layers will need to decide whether to migrate to Managed Agents or keep their own stack. Teams that postponed agent work because of the operational complexity now have a faster on-ramp.
The competitive implication is that OpenAI's Responses API and Google's Gemini agent tooling now have a sharper benchmark to beat. Both companies have shipped agent-memory features in research previews of their own. Neither has paired memory with a self-grading loop and parallel subagent dispatch in a single integrated platform.
Anthropic's launch lands the same week the company closed a $200 billion Google Cloud commitment and unveiled a pre-built agent stack for financial services with Jamie Dimon's public blessing. The product strategy is now visible. The model layer keeps shipping, but the platform layer is where Anthropic is widening its lead. Compare that posture to the Claude Opus 4.7 release in April, where the company shipped a frontier model and immediately pointed to a more powerful unreleased one. The model arms race is no longer the story. The agent runtime is.
A scheduled background process that prunes memory and surfaces patterns is not a new idea in machine learning. What is new is shipping it as a first-class platform primitive next to a self-grading loop and parallel subagent dispatch, then pointing at a customer who saw a six-fold rise in task completion. Harvey is one data point, and the broader benchmarks are still Anthropic's own.
The deeper signal is what Anthropic is treating as the bottleneck. The company is not racing OpenAI on raw model capability this week. It is racing on the layer above the model, the layer where agents fail in production for boring reasons that no benchmark captures. As one Anthropic engineer put it on stage: agents do their best work when they know what good looks like.
Sources
- New in Claude Managed Agents: dreaming, outcomes, and multiagent orchestration (Anthropic Official Blog, May 6, 2026)
- Anthropic introduces "dreaming," a system that lets AI agents learn from their own mistakes (VentureBeat, May 6, 2026)
- Anthropic updates Claude Managed Agents with three new features (9to5Mac, May 7, 2026)
- Anthropic will let its managed agents dream (The New Stack, May 7, 2026)
- Code with Claude San Francisco 2026 (Anthropic Event Page, May 6, 2026)
- Anthropic brings dreaming, outcomes, and multiagent orchestration to Claude agents (Crypto Briefing, May 7, 2026)
- Live blog: Code w/ Claude 2026 (Simon Willison, May 6, 2026)