Skip to content

Anthropic Taught Claude's Agents to Dream. Harvey's Tests Showed a 6x Jump in Completion Rates.

DS
LDS Team
Let's Data Science
11 min
Anthropic launched a feature called dreaming on Wednesday, a scheduled process that lets agents review their own past sessions between jobs. Legal-AI startup Harvey ran the pilot and saw task completion rates climb roughly six times. Multiagent orchestration and a self-grading "outcomes" loop shipped the same day.

On Wednesday morning in San Francisco, Anthropic's Chief Product Officer Ami Vora walked onto the Code with Claude stage and told a room of developers that their AI agents were about to start dreaming.

She meant it almost literally. The new feature, available in research preview on the Claude Platform, is called dreaming. It is a scheduled process that runs between agent sessions, reviews everything an agent did in its last job, pulls patterns out of those sessions, and writes new memory entries that the next session can use. Anthropic compares it to hippocampal memory consolidation, the way a human brain replays the day's events during sleep and decides what to keep.

Legal-AI startup Harvey, one of Anthropic's most visible enterprise customers, had been running dreaming on its own agents before the public launch. According to Anthropic's official announcement, Harvey's task completion rates rose roughly 6x in internal testing once dreaming was turned on. The catch the startup kept hitting before that was small and ordinary: agents kept forgetting filetype quirks and tool-specific workarounds between sessions, so the same legal-drafting jobs failed in the same way over and over. With dreaming, the workarounds stuck.

The Conference That Was Mostly About Memory

Code with Claude San Francisco ran on May 6, the first of three stops on Anthropic's developer-conference tour. London follows on May 19. Tokyo follows on June 10. The keynote slot belonged to Vora, who replaced Mike Krieger as Anthropic's product chief earlier this year. Krieger now co-leads Anthropic Labs.

The day produced a stack of announcements, but the through-line was unmistakable. Anthropic wants Claude agents to do more work without a human watching, and the company has decided the bottleneck is memory.

Dreaming was one of three Managed Agents features that shipped on Wednesday. The others were outcomes, a self-grading loop in which a separate evaluator scores an agent's output against a written rubric and tells it what to fix, and multiagent orchestration, which lets a lead agent fan a job out to specialist subagents running in parallel. Outcomes and multiagent orchestration both went into public beta. Dreaming stayed in research preview, with developers needing to request access.

Vora's framing of the system was direct. "Memory lets each agent capture what it learns as it works," Anthropic wrote in the official launch post. "Dreaming refines that memory between sessions, pulling shared learnings across agents and keeping it up-to-date."

How Dreaming Actually Works

Dreaming does not change Claude's model weights. That part matters. The technique is closer to a structured note-taking ritual than to training.

When a developer enables dreaming on a Managed Agent, Anthropic schedules a background process that periodically reads through the agent's recent sessions and its memory store. The process looks for three kinds of patterns: recurring mistakes the agent keeps making, workflows the agent converges on across different jobs, and preferences that have emerged across a team of agents. It then rewrites the memory store, condensing what is now stale, promoting what is now load-bearing.

Developers can let dreaming update memory automatically, or they can require human review before any change lands.

The behavior matters most for what Anthropic calls long-running work: jobs that span many sessions, days, or operators, where the same agent or fleet of agents keeps returning to the same problem space. The classic failure mode for an LLM agent in that setting is that every session starts fresh, so every session relearns the same recurring quirks. Dreaming attacks that failure mode directly by giving the agent a curated set of notes from its own past.

The published evidence so far is limited to customer testimonials. Anthropic released no independent benchmark with the launch.

What Outcomes And Multiagent Orchestration Add

The second feature, outcomes, addresses a different agent failure mode: the agent finishes a task, but the output does not meet the developer's quality bar, and nobody catches it until a human reviews the result hours later.

With outcomes, the developer writes a rubric, in plain language, describing what a successful output looks like. The agent does its work. Then a separate grader, running in its own context window so its judgment is not contaminated by the agent's reasoning, scores the output against the rubric and tells the agent exactly what to fix. The agent then takes another pass.

Anthropic's internal benchmarks claim outcomes improved task success by up to 10 points over a standard prompting loop. The biggest gains were on the hardest problems. File-generation quality, the company's data showed, rose by 8.4% on .docx outputs and 10.1% on .pptx outputs. Developers can also wire outcomes to a webhook, so the agent runs, the grader signs off, and the developer gets notified only when the output is good enough.

Multiagent orchestration is the third piece. A lead agent breaks a complex job into chunks. It hands each chunk to a specialist subagent with its own model, prompt, and tools. The specialists run in parallel on a shared filesystem, contribute back to the lead agent's context, and remain individually traceable in the Claude Console.

The shape of that workflow is now visible inside Netflix's platform team. According to Anthropic, the team built an analysis agent that processes build logs from hundreds of source repositories. Multiagent orchestration lets the lead agent fan the batch out to subagents that scan in parallel and report back only on the patterns worth acting on. Spiral, a writing tool built by the publication Every, uses a similar structure: a Haiku-based lead agent fields incoming requests and delegates drafting to Opus-based subagents. When a user asks for multiple drafts, the Opus subagents run side by side. The drafts only return to the user if they clear the outcomes rubric scored against Every's editorial principles.

Wisedocs, a document-review startup, reported that reviews now run 50% faster since adopting outcomes for grading.

The Customer Numbers Anthropic Disclosed

CustomerFeature in UseReported Result
HarveyDreaming~6x rise in task completion rates in internal tests
Netflix platform teamMultiagent orchestrationAnalyzes build logs from hundreds of sources in parallel
Spiral by EveryMultiagent orchestration + outcomesLead agent on Haiku, drafting subagents on Opus; drafts gated by outcomes rubric
WisedocsOutcomes50% faster reviews against internal guidelines
Internal Anthropic benchmarksOutcomes+10 pts task success; +8.4% docx; +10.1% pptx

The 6x claim from Harvey is the headline number, but it is also the one Anthropic is publishing without an external benchmark to back it up. Practitioners will want to know whether Harvey's tests are representative of broader agent workflows or specific to the long-form legal-drafting tasks where the company already had a clear pre-dreaming failure mode.

How It Unfolded This Week

APRIL 9, 2026
Claude Managed Agents launches
Anthropic opens Managed Agents, a cloud-hosted runtime for building and deploying Claude-powered agents, to enterprise customers.
APRIL 30, 2026
Claude Security enters public beta
A new product line aimed at SOC teams launches in public beta, expanding the Claude Platform footprint.
MAY 6, 2026
Code with Claude SF: dreaming, outcomes, and multiagent orchestration ship
Ami Vora keynotes the developer conference. Dreaming launches in research preview, outcomes and multiagent orchestration go to public beta.
MAY 7, 2026
Microsoft 365 integration announced
Anthropic ships native Claude integrations for Excel, PowerPoint, Word, and Outlook on the same conference tour.
MAY 19 & JUNE 10, 2026
Conference tour continues
Code with Claude London on May 19; Code with Claude Tokyo on June 10. Same announcements, regional access.

The Skeptical Read

Not everyone buys the framing. Multiple AI engineers writing about the launch noted that dreaming sounds more novel than it is. The technique reduces to scheduled memory pruning and pattern extraction, applied to the LLM agent's persistent notes rather than to anything inside the model itself.

That framing is not wrong. The model weights do not change. The agent is not learning in the strict sense of gradient updates. What the agent gets is a curated set of plain-text notes that summarize what worked, what failed, and what to try next.

The harder question is whether that distinction matters in practice. If an agent that runs every day on the same workflow can avoid the same recurring failure for the eleventh time because dreaming wrote a note last night, the engineer who was paged about the previous ten failures will not care what the technique reduces to.

A second concern, raised by some security researchers, is that giving agents structured persistent memory expands the attack surface for prompt-injection and memory-poisoning attacks. If a malicious input can convince an agent that the wrong instruction is the right one, dreaming may consolidate that wrong instruction into the agent's long-term memory store, where it will be applied to future sessions automatically. Anthropic's documentation flags the issue and recommends human review on memory updates for high-stakes workflows.

The company that built the legal AI startup that pioneered dreaming made the same point in its own framing. Harvey told Anthropic that dreaming worked best when paired with a tight outcomes rubric, so that any drift in memory would be caught by the grader on the next run.

What This Means For Practitioners

For ML engineers and AI engineers building production agents, the practical change is straightforward. Three pieces of plumbing that previously had to be assembled by hand, persistent memory, self-grading, and parallel subagent dispatch, are now first-class primitives on Anthropic's platform.

Teams that were already building agent systems on Claude with bespoke memory layers will need to decide whether to migrate to Managed Agents or keep their own stack. Teams that postponed agent work because of the operational complexity now have a faster on-ramp.

The competitive implication is that OpenAI's Responses API and Google's Gemini agent tooling now have a sharper benchmark to beat. Both companies have shipped agent-memory features in research previews of their own. Neither has paired memory with a self-grading loop and parallel subagent dispatch in a single integrated platform.

Anthropic's launch lands the same week the company closed a $200 billion Google Cloud commitment and unveiled a pre-built agent stack for financial services with Jamie Dimon's public blessing. The product strategy is now visible. The model layer keeps shipping, but the platform layer is where Anthropic is widening its lead. Compare that posture to the Claude Opus 4.7 release in April, where the company shipped a frontier model and immediately pointed to a more powerful unreleased one. The model arms race is no longer the story. The agent runtime is.

The Bottom Line

A scheduled background process that prunes memory and surfaces patterns is not a new idea in machine learning. What is new is shipping it as a first-class platform primitive next to a self-grading loop and parallel subagent dispatch, then pointing at a customer who saw a six-fold rise in task completion. Harvey is one data point, and the broader benchmarks are still Anthropic's own.

The deeper signal is what Anthropic is treating as the bottleneck. The company is not racing OpenAI on raw model capability this week. It is racing on the layer above the model, the layer where agents fail in production for boring reasons that no benchmark captures. As one Anthropic engineer put it on stage: agents do their best work when they know what good looks like.

Sources

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths