Every engineer who has shipped an agent into a long-running task knows the failure mode. The model declares victory. The tests say otherwise. You scroll back through a thousand lines of confident output looking for the moment it started guessing. Anthropic is claiming, with benchmark evidence, that Opus 4.8 does this to you far less often.
The model went live the same day Anthropic announced a $65 billion funding round that valued it at 965 billion dollars. The money was the louder story. For people who actually build on Claude, the release is the one that changes Monday morning.
The Coding Numbers Are the Reason to Look
On SWE-Bench Pro, the harder of the two standard agentic coding tests, Opus 4.8 resolved 69.2% of tasks. That is up from 64.3% on Opus 4.7 and well ahead of GPT-5.5 at 58.6%. Anthropic says the model beats both GPT-5.5 and Gemini 3.1 Pro across most of the categories it tested.
On Humanity's Last Exam, the multidisciplinary reasoning benchmark, Opus 4.8 scored 49.8% with no tools and 57.9% with tools. Anthropic reports both as the highest marks in the field. On a knowledge-work benchmark that grades real professional tasks, the model reached 1,890 points at its highest effort setting, 137 above Opus 4.7 and 121 ahead of GPT-5.5.
| Benchmark | Opus 4.8 | Opus 4.7 | GPT-5.5 |
|---|---|---|---|
| SWE-Bench Pro (agentic coding) | 69.2% | 64.3% | 58.6% |
| Humanity's Last Exam (with tools) | 57.9% | 54.7% | n/a |
| Online-Mind2Web (computer use) | 84% | lower | lower |
| Knowledge work (max effort, points) | 1,890 | 1,753 | 1,769 |
The takeaway for practitioners is narrow and concrete. If you run agentic coding pipelines, Opus 4.8 is now the strongest generally available model for the job, and the upgrade path is mechanical: change the model identifier to claude-opus-4-8 and rerun. Pricing did not move. Standard usage stays at 5 dollars per million input tokens and 25 dollars per million output, the same as Opus 4.7.
There is one place GPT-5.5 still wins. On the Terminal-Bench 2.1 test of terminal coding, OpenAI's model scores 83.4% using its own Codex command-line harness, a lead Anthropic acknowledged in a footnote.
The Honesty Gain Is the One That Matters in Production
AI models have a well-documented habit of jumping to conclusions and claiming progress that falls apart on a second look. Anthropic calls fixing this one of the most noticeable changes in Opus 4.8.
"Early testers report that Opus 4.8 is more likely to flag uncertainties about its work and less likely to make unsupported claims," the company wrote. The four-times figure on undetected code flaws is the measurable version of that claim, drawn from Anthropic's own coding evaluations.
The alignment data points the same direction. Anthropic's safety team reported that Opus 4.8 shows rates of misaligned behavior, things like deception or cooperation with misuse, that are substantially lower than Opus 4.7 and close to Claude Mythos Preview, the company's best-aligned model. The team also said Opus 4.8 "reaches new highs on our measures of prosocial traits like supporting user autonomy and acting in the user's best interest."
For anyone handing real work to an autonomous agent, a model that tells you when it is unsure is worth more than two points on a benchmark.
Dynamic Workflows Is the Feature Anthropic Almost Hid
The model update is incremental. The thing shipped next to it is not.
Anthropic launched dynamic workflows in Claude Code, a research-preview feature that lets Claude plan a large task, spin up hundreds of parallel subagents in a single session, and verify its own outputs before reporting back. The example the company gave is the one that will get attention in engineering orgs: Claude Code with Opus 4.8 can now run codebase-scale migrations across hundreds of thousands of lines of code, from kickoff to merge, using the existing test suite as its bar for success.
That is a different unit of work than autocomplete. It is the difference between asking a model to write a function and asking it to refactor a service. Dynamic workflows is available on Claude Code Enterprise, Team, and Max plans.
Two more changes landed quietly. A new effort control in claude.ai and Cowork lets users dial how hard the model thinks on a given response, trading speed for depth, and it is available on every plan. And the Messages API now accepts system instructions inside the messages array, which means developers can update an agent's permissions, token budget, or environment context mid-task without breaking the prompt cache. Small on paper. Useful if you operate agents that run for hours.
The Other Side: Anthropic Itself Calls This Modest
Anthropic's own description of Opus 4.8 was not triumphant. "Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor," the company wrote. The release arrives less than two months after Opus 4.7, the pace at which an upgrade starts to look like benchmark maintenance rather than a leap.
The cost picture is also more complicated than the flat price suggests. According to Artificial Analysis, on the knowledge-work benchmark Opus 4.8 needs 15% fewer passes and 35% fewer output tokens per task than Opus 4.7, which could ease the real-world bill. But the same analysis found Opus 4.8 still burns roughly 30% more passes than GPT-5.5 to do comparable work. The sticker price and the invoice are not the same number.
And the model you can buy is, once again, not the best model Anthropic has built. The company used Thursday's announcement to repeat that Claude Mythos, its frontier cybersecurity model, remains restricted to a small set of organizations under Project Glasswing, with broader release promised "in the coming weeks." Opus 4.8 is the commercial ceiling. It is not the actual ceiling.
The Bottom Line
Opus 4.8 is a small step on the benchmarks and a larger one on trust. The SWE-Bench Pro jump to 69.2 percent keeps Anthropic on top of agentic coding, but the four-times reduction in unflagged code flaws is the number that changes how much you can hand to an agent without checking its work. For teams already on Opus 4.7, the move is a one-line identifier swap at the same price.
The release Anthropic framed as "modest" is carrying a feature that is anything but. Dynamic workflows, with its hundreds of parallel subagents running whole-codebase migrations to a green test suite, is the kind of capability that quietly redraws what a single engineer can ship in a day. The model got a little better. What you can do with it got a lot bigger.
Whether the next class of model, the Mythos-grade system Anthropic keeps promising "in the coming weeks," ever reaches the customers paying for Opus is the question the company has now deferred for two straight releases.
Sources
- Introducing Claude Opus 4.8 (Anthropic, May 28, 2026)
- Claude Opus 4.8 System Card (Anthropic, May 28, 2026)
- Anthropic ships Claude Opus 4.8 as a "modest but tangible improvement" that tops GPT-5.5 in most benchmarks (The Decoder, May 28, 2026)
- Anthropic Launches Claude Opus 4.8 With Gains in Coding and Honesty (MacRumors, May 28, 2026)
- Anthropic upgrades Claude with new Opus 4.8 model, here's what's new (9to5Mac, May 28, 2026)
- Anthropic raises 65 billion dollars in Series H funding at a 965 billion dollar valuation (Anthropic, May 28, 2026)