OpenAI Built a Model That Uses a Computer Better Than You Do. It Needed To.

DS
LDS Team
Let's Data Science
7 minAudio · 1 listens
Listen Along
0:00 / 0:00
AI voice
GPT-5.4 scores 75% on OSWorld, surpassing human performance at 72.4%, as OpenAI fights to reclaim users lost to the Pentagon deal fallout and the #CancelChatGPT movement.

On Wednesday evening, while most of Silicon Valley was still digesting the week's drama around AI military contracts and user boycotts, OpenAI quietly pushed a button. GPT-5.4 went live across ChatGPT, the API, and Codex simultaneously. No countdown. No keynote. Just a blog post and a new model in the dropdown.

The timing was not accidental.

OpenAI has spent the past two weeks watching 1.5 million users walk out the door. The #CancelChatGPT movement sent ChatGPT uninstalls up 295%. Claude briefly hit number one on the App Store. CEO Sam Altman told staff the Pentagon deal backlash was "really painful," admitting publicly that the whole thing "looked opportunistic and sloppy." GPT-5.4, billed as "our most capable and efficient frontier model for professional work," is OpenAI's answer to all of it.

And on paper, it is a serious answer.

The First AI That Operates a Computer Better Than a Human

The headline number is striking and worth sitting with: 75% on OSWorld-Verified, a benchmark that tests whether an AI agent can actually operate a desktop computer. Open applications, click buttons, fill forms, switch between windows, complete multi-step workflows. The kind of work that every knowledge worker does for eight hours a day.

Human performance on that same benchmark sits at 72.4%.

GPT-5.4 is the first general-purpose large language model to cross that line. Its predecessor, GPT-5.2, scored 47.3%. That is not an incremental improvement. It is a 58% jump in a single generation, and it signals something the industry has been racing toward since Anthropic introduced computer use with Claude Opus 4.6 last year: AI agents that do not just talk about work, but do the work.

The computer-use capability is native, meaning it is baked into the model's architecture rather than bolted on through external tooling. GPT-5.4 processes screenshots, issues mouse commands, types keyboard inputs, and handles complex multi-application workflows autonomously. No special agent framework required. No custom infrastructure.

On WebArena-Verified, which tests browser-based navigation, GPT-5.4 scored 67.3% using both DOM and screenshot-driven interaction, up from GPT-5.2's 65.4%.

Three Models in a Trench Coat

OpenAI shipped GPT-5.4 in three variants, each targeting a different use case.

The standard GPT-5.4 is the everyday model. It handles general queries, coding, analysis, and conversation. It prices at $2.50 per million input tokens and $15 per million output tokens, a modest bump from GPT-5.2's $1.75 and $14.

GPT-5.4 Thinking is the reasoning variant. Think of it as the chain-of-thought model that works through problems step by step before answering. It scored 92.8% on GPQA Diamond (scientific reasoning) and 73.3% on ARC-AGI-2 (abstract reasoning), compared to GPT-5.2 Pro's 54.2% on the latter.

GPT-5.4 Pro is the heavy hitter, priced accordingly at $30 per million input tokens and $180 per million output tokens. It is built for sustained, high-stakes professional work: investment banking models, legal analysis, multi-hour research tasks. Pro pushes the reasoning scores even higher: 94.4% on GPQA Diamond, 83.3% on ARC-AGI-2, and 38.0% on FrontierMath Tier 4, a benchmark designed to stump professional mathematicians.

All three share a million-token context window, the largest OpenAI has ever offered. Prompts under 272,000 tokens get standard pricing. Go beyond that, and input costs double while output costs jump 1.5x for the full session.

ModelInput Cost (per 1M tokens)Output Cost (per 1M tokens)Best For
GPT-5.4$2.50$15.00General use, coding, conversation
GPT-5.4 Thinking$2.50$15.00Complex reasoning, math, science
GPT-5.4 Pro$30.00$180.00Enterprise workflows, finance, law

The Benchmark Blitz

Beyond computer use, GPT-5.4's numbers paint a picture of broad improvement.

On GDPval, OpenAI's internal benchmark for knowledge work across 44 professional occupations, GPT-5.4 hit 83.0%. GPT-5.2 scored 70.9%. That 12-point gap matters because GDPval tests the kind of work that fills most white-collar days: drafting reports, analyzing data, summarizing documents, building presentations.

When human evaluators compared GPT-5.4's presentation output against earlier models, they preferred the new version 68% of the time.

Error rates dropped meaningfully. Individual claims in GPT-5.4 responses are 33% less likely to be wrong compared to GPT-5.2. Complete answers are 18% less likely to contain any errors at all.

On coding, GPT-5.4 consolidates the capabilities of GPT-5.3 Codex, the dedicated coding model OpenAI shipped in December. SWE-Bench Pro scores hit 57.7%, a modest tick above GPT-5.3 Codex's 56.8%, but the real story is that coding ability now ships inside the general model rather than as a separate product. Developers no longer need to route between different models for different tasks.

The model also introduced what OpenAI calls a "/fast" mode that boosts token generation speed up to 1.5x, addressing the persistent complaint that reasoning models feel slow.

Aug 2025
GPT-5 launches as OpenAI's first post-GPT-4 model
GDPval: 54.2%. The start of the GPT-5 generation.
Dec 2025
GPT-5.2 ships with improved reasoning
OSWorld: 47.3%. GDPval: 70.9%. First major reasoning upgrade.
Dec 2025
GPT-5.3 Codex launches as a dedicated coding model
SWE-Bench Pro: 56.8%. Separate model for code generation.
Mar 5, 2026
GPT-5.4 unifies coding, reasoning, and computer use
OSWorld: 75.0% (surpasses human 72.4%). GDPval: 83.0%. First model to beat humans at operating a computer.

Tool Search Cuts Token Bills by 47%

Buried in the announcement was a feature that will matter more to developers than any benchmark: tool search.

Modern AI applications often define dozens or hundreds of tools (API endpoints, functions, database queries) that a model can call. Until now, every tool's full schema had to be loaded into the prompt on every single request, eating thousands of tokens before the model even read the user's question. For applications with large tool ecosystems, this overhead was brutal.

GPT-5.4's tool search flips the approach. The API receives a lightweight list of available tools, and the model looks up full definitions only when it decides to use one. OpenAI reports a 47% reduction in token consumption for tool-heavy applications.

For teams running thousands of API calls per day, that translates directly to lower bills, even at GPT-5.4's slightly higher per-token pricing.

Wall Street Gets Its Own AI

Alongside GPT-5.4, OpenAI launched ChatGPT for Excel in beta, an add-in that brings GPT-5.4 directly into Microsoft Excel workbooks. Users can build financial models, run scenario analysis, and generate outputs using natural language commands inside their spreadsheets.

The timing is pointed. OpenAI simultaneously announced financial data integrations with Moody's, Dow Jones Factiva, MSCI, Third Bridge, and MT Newswire, with FactSet coming soon. The message to Wall Street is clear: stop copy-pasting between ChatGPT and your spreadsheets.

On OpenAI's internal investment banking benchmark, which tests tasks like building three-statement financial models with proper formatting and citations, GPT-5.4 Thinking scored 87.3%. The original GPT-5 scored 43.7% on the same test. That is a doubling of capability in seven months.

The Excel add-in is initially available to Plus, Team, Enterprise, and Edu subscribers in the United States, Canada, and Australia.

The Competitive Scoreboard Gets Messier

GPT-5.4 enters a market where no single model dominates across all dimensions.

On abstract reasoning, Google's Gemini 3.1 Pro still leads with 94.3% on GPQA Diamond, edging GPT-5.4's 92.8% and Anthropic's Claude Opus 4.6 at 91.3%. On ARC-AGI-2, Gemini again tops the chart at 77.1%, ahead of Opus 4.6's 75.2% and GPT-5.4's 73.3%.

But GPT-5.4 owns computer use and knowledge work. No other model touches 75% on OSWorld. And the 83% GDPval score puts it firmly ahead of both competitors on professional task completion.

On coding, Claude Opus 4.6 still holds the crown with 80.8% on SWE-Bench Verified and strong marks on production-grade bug fixing. GPT-5.4's 57.7% on SWE-Bench Pro is respectable, but the gap remains real.

BenchmarkGPT-5.4Claude Opus 4.6Gemini 3.1 Pro
OSWorld-Verified75.0%N/AN/A
GPQA Diamond92.8%91.3%94.3%
ARC-AGI-273.3%75.2%77.1%
SWE-Bench VerifiedN/A80.8%N/A
GDPval83.0%N/AN/A

The picture that emerges is one of specialization. GPT-5.4 is the best model for autonomous computer operation and professional knowledge work. Claude Opus 4.6 remains the coding leader. Gemini 3.1 Pro offers the strongest pure reasoning at the best price point ($2 per million input tokens, cheapest of the three).

The Cybersecurity Elephant in the Room

GPT-5.4's system card contains a notable first: it is the first general-purpose model that OpenAI has classified as "High Capability" for cybersecurity. That is not a boast. It is a warning label.

The classification means GPT-5.4 is capable enough at offensive cybersecurity tasks that OpenAI built dedicated mitigations directly into the model. A two-tier monitoring system runs in real time: a fast topic classifier identifies whether a query touches cybersecurity territory, and a secondary AI security analyst determines whether the specific response falls within acceptable bounds.

OpenAI trained GPT-5.4 to provide helpful guidance on defensive cybersecurity while refusing operational instructions for malware creation, credential theft, and chained exploitation. The company deployed expanded monitoring, trusted access controls, and asynchronous blocking for high-risk requests.

This is the tension at the heart of frontier AI development. The same capability that lets GPT-5.4 autonomously operate a computer also makes it a more potent tool for bad actors. OpenAI's response is to ship the capability with guardrails rather than withhold it entirely.

Developers Are Cautiously Impressed

The developer community's reaction has been measured. On forums and social media, the dominant sentiment is not excitement about raw intelligence gains but appreciation for practical improvements.

"The raw logical reasoning does not feel dramatically smarter with each version," wrote developer Lars Nietvelt on DEV Community. "But what does improve is how much better these models get at understanding what you are asking for. That is useful. But it is not the same as becoming more intelligent."

The million-token context window drew genuine enthusiasm. Developers building AI tooling noted that feeding entire codebases, documentation chains, and log files into a single query "unlocks things that were not feasible before."

Multiple developers reported that GPT-5.4 finally resolved the persistent "lazy model" problem, where earlier versions would stall halfway through complex tasks or skip steps. Execution speed reportedly doubled compared to GPT-5.3.

OpenAI researcher Noam Brown pushed back on any narrative of slowing progress. "We see no wall," Brown stated, "and expect AI capabilities to continue to increase dramatically this year."

A Win OpenAI Desperately Needed

Gizmodo's headline captured the subtext that OpenAI could not say out loud: "OpenAI, in Desperate Need of a Win, Launches GPT-5.4."

The context matters. OpenAI's Pentagon partnership triggered the largest user exodus in the company's history. Anthropic's very public refusal to compromise its safety guardrails for military work created a stark contrast that users rewarded with their wallets and their app downloads. Altman acknowledged to staff that the backlash was "really painful" and that the deal "looked opportunistic and sloppy."

GPT-5.4 does not make any of that go away. But it does give OpenAI something concrete to point to: a model that can operate a computer better than a person, that hallucinates less, that costs less per task despite higher per-token pricing, and that comes bundled with the kind of enterprise financial tools that generate real revenue.

GPT-5.2 Thinking will remain available for three months before being retired on June 5, 2026, giving teams time to migrate.

The Bottom Line

GPT-5.4 is a genuinely strong model wrapped in a genuinely complicated moment. The benchmarks are real. The computer-use capability is a legitimate milestone, one that puts OpenAI ahead of Anthropic and Google in the specific race to build AI agents that can operate software autonomously. The financial integrations signal a company pivoting hard toward enterprise revenue.

But technology does not exist in a vacuum. OpenAI is shipping its most capable model at the exact moment its brand is most damaged. The question is whether performance can outrun reputation, whether developers and enterprises care more about OSWorld scores than Pentagon contracts.

Noam Brown says there is no wall. The users who left suggest there might be one that benchmarks cannot measure.

Sources