Skip to content

DeepSeek V4 Matched Frontier Models on Three Benchmarks. Its API Costs One-Twentieth as Much as Claude Opus.

DS
LDS Team
Let's Data Science
11 min
V4-Pro hit 93.5 on LiveCodeBench and 80.6% on SWE-bench Verified Friday, ahead of Gemini 3.1 and within two-tenths of a point of Claude Opus 4.6. Open weights shipped under the MIT License on Hugging Face. Output tokens run $3.48 per million, roughly 1/9th the Opus price and a 98% discount to GPT-5.5 Pro.

On Friday, April 24, 2026, the DeepSeek team posted two model checkpoints to Hugging Face and a 600-word announcement on its API docs. The post was titled "DeepSeek-V4 Preview Release." Inside the documentation were links to a 1.6-trillion-parameter mixture-of-experts model, a 284-billion-parameter sibling, and a tech report claiming both could process one million tokens of context.

By Saturday morning, every major benchmark site had posted preliminary scores. By Sunday, three of them were ranking V4-Pro ahead of Google's Gemini 3.1 Pro on coding tasks and within a fraction of Anthropic's Claude Opus 4.6 on real-world software engineering, the model whose release was covered in Claude Opus 4.6: Anthropic Just Dropped Its Most Intelligent Model.

The bigger surprise sat in DeepSeek's pricing table.

V4-Pro charges $1.74 per million input tokens and three dollars and forty-eight cents per million output tokens. Claude Opus 4.7 charges five dollars input and twenty-five dollars output. GPT-5.5 charges five dollars input and thirty dollars output. On combined input-and-output workloads, V4-Pro runs roughly 1/9th the price of GPT-5.5 and 1/9th the price of Opus 4.7, with the cost gap widening to 1/20th on output-heavy workloads.

For data scientists running production inference at meaningful volume, that math is the story.

What V4 Actually Hit

DeepSeek's V4-Pro is a sparse mixture-of-experts model with 1.6 trillion total parameters and 49 billion active per token. V4-Flash is a smaller sibling with 284 billion total and 13 billion active. Both target a one million token context window, the largest publicly documented context for an open-weights model under permissive license.

The headline benchmark scores, taken from DeepSeek's tech report and from Vals AI, Arena.ai, and SWE-bench's leaderboards through April 26:

BenchmarkV4-ProClosest FrontierNotes
LiveCodeBench93.5Gemini 3.1 Pro at 91.7V4-Pro leads open and closed
Codeforces (rating)3206GPT-5.4 at 3168Real-world competitive code
SWE-bench Verified80.6%Claude Opus 4.6 at 80.8%0.2 pt behind
SWE-bench Pro55.4%Claude Opus 4.7 at 64.3%Still a gap on hard agent tasks
MMLU-Pro87.5%GPT-5.5 at ~88%Within margin of error
GPQA Diamond90.1%Claude Opus 4.6 at ~90%Near-frontier reasoning

The benchmarks where V4-Pro leads are unusually concentrated in coding and reasoning. The benchmarks where it trails are mostly long-horizon agent tasks where Anthropic's Opus line is still the leader.

Vals AI summarized V4 as "overwhelmingly" topping the open-source field in its Vibe Code Benchmark, beating closed-source Gemini 3.1 Pro and producing what its analysts called "about a tenfold performance leap" over V3.2. Arena.ai placed V4-Pro in thinking mode third among open-source coding models and fourteenth overall in its code arena.

The Architecture Story

The V4 jump did not come from scaling parameters alone. DeepSeek's tech report describes three architectural changes that explain how a 1.6T parameter model runs cheaper per token than a 600B parameter dense competitor.

The first change is the attention layer. V4 uses a hybrid of two new mechanisms, Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). The pair targets the long-context regime: at one million tokens, V4-Pro requires only 27% of the single-token inference FLOPs and 10% of the KV cache that V3.2 needed at the same context length.

For practitioners, that means long-context inference, the use case where most frontier models fall over, becomes economically viable on V4-Pro at prices an order of magnitude below the alternatives.

The second change is Manifold-Constrained Hyper-Connections (mHC), a residual-connection variant that DeepSeek says improves signal stability across very deep layers. The third is a switch to the Muon optimizer, which the team claims gave faster convergence and more stable training behavior than AdamW at this scale.

DeepSeek released the open weights for both V4-Pro and V4-Flash under the MIT License, with the full tech report at huggingface.co/deepseek-ai/DeepSeek-V4-Pro. The license permits commercial use without attribution.

How DeepSeek's API Pricing Compares

ModelInput ($/M tokens)Output ($/M tokens)Open weights?
DeepSeek V4-Flash0.140.28Yes (MIT)
DeepSeek V4-Pro1.743.48Yes (MIT)
Claude Sonnet 4.53.0015.00No
GPT-5.5 Mini0.401.60No
Gemini 3.1 Flash0.302.50No
Claude Opus 4.75.0025.00No
GPT-5.55.0030.00No

V4-Pro is closer in price to a fast model like GPT-5.5 Mini than it is to its actual benchmark peers. V4-Flash undercuts every closed frontier-class flash variant on the table.

For teams running inference workloads in the millions of tokens per day, the implications are direct. A workload that costs $30,000 per month on Claude Opus 4.7 output tokens runs closer to forty-two hundred on V4-Pro and roughly three hundred and forty dollars on V4-Flash, assuming similar output volumes.

How V4 Reached Friday

JANUARY 2025
DeepSeek R1 reset the field
R1 launched at a fraction of GPT-4 training cost and triggered a market rerating of US AI hardware spend.
DECEMBER 2025
V3.2 ships
DeepSeek V3.2 lands as the highest-scoring open MoE model on coding tasks until April 2026.
APRIL 13, 2026
Smuggled-chip story breaks
Reports allege DeepSeek trained its newest models on Nvidia chips routed through third-country buyers, a story covered in earlier LDS reporting.
APRIL 22, 2026
Closed beta seen on Arena.ai
An anonymous model dubbed "speedmaster-pro" begins crushing the open-source code arena. The community correctly guesses it is V4.
APRIL 24, 2026
V4 preview goes public
DeepSeek posts V4-Pro and V4-Flash to Hugging Face under MIT License. API turns on at one-twentieth Opus pricing.

The Reception

VentureBeat called V4 "near state-of-the-art intelligence at one-sixth the cost." MIT Technology Review's Will Douglas Heaven argued the release "closes the gap" with frontier closed models on most measurable axes.

DeepSeek's internal team went further. The company said V4 has become its primary model for agentic coding among employees, with evaluation feedback claiming the experience exceeds Claude Sonnet 4.5 and approaches Opus 4.6 in non-thinking mode.

NVIDIA published a developer post the same Friday explaining how to deploy V4-Pro on Blackwell GPUs and through GPU-accelerated endpoints, a sign that the largest US chip vendor sees V4 as a serious enterprise inference target.

Not everyone is sold. Tech commentator Michael Anti, posting on X, said V4 Flash's actual experience did not surpass the well-developed V3.2 and described the upgrade as disappointing for established users. The launch also showed visible rough edges, including chat instances reportedly self-identifying as V3, confusion about model availability across endpoints, and incomplete third-party benchmark coverage.

There is also the gap on hardest tasks. SWE-bench Pro, the most challenging agent benchmark, still favors Claude Opus 4.7 at 64.3% over V4-Pro at 55.4%, a roughly nine-point spread that matters for production agentic workloads.

The Other Side: What the Cheap Tokens Hide

Pricing this aggressive raises three questions practitioners should consider before swapping providers.

The first is throughput. DeepSeek's API has historically rate-limited at much tighter thresholds than OpenAI or Anthropic, especially during peak hours in Asia-Pacific. Several analysts including SemiAnalysis founder Dylan Patel have flagged the V4 launch capacity as "constrained" in the first two weeks.

The second is data residency. DeepSeek's hosted API runs out of mainland China. For US enterprises with regulated data, that is a hard stop. The MIT-licensed weights solve this only if the buyer is willing to self-host, which means owning enough Blackwell or Hopper GPUs to run a 49B active-parameter MoE model at acceptable latency.

The third is provenance. A separate LDS investigation documented in DeepSeek Trained Its Trillion-Parameter Model on Smuggled Nvidia Chips reported that US officials concluded V4-class training runs were powered by Nvidia chips routed through third-country buyers in apparent violation of export controls. Several US government contractors have already informed staff that DeepSeek-hosted endpoints are off-limits while the issue is reviewed.

The benchmark scores do not change. The procurement implications might.

What It Means for Practitioners

The first practical question is: where does V4 actually save money?

For long-context retrieval, document analysis, and code search workloads, V4-Pro at $3.48 per million output tokens replaces Opus 4.7 at twenty-five dollars almost line for line on accuracy. The savings are real and immediate.

For agentic coding workloads where the model has to plan, execute, and revise across long horizons, Claude Opus 4.7 still produces better outcomes per task. The price premium buys success rate on the hardest 15-20% of tasks where V4 falls short.

For self-hosted deployment, V4-Flash is the most interesting option. At 284 billion total parameters with 13 billion active, it can run on a single eight-GPU Blackwell node with 80GB memory per GPU, putting it within reach of any enterprise that already runs Llama 4 or Mistral Large at scale.

For Hugging Face users, the open weights are available now under MIT, which means downstream fine-tunes will start appearing within days. By mid-May, expect domain-specialized variants from Together AI, Fireworks, DeepInfra, and several research labs.

The Bottom Line

DeepSeek shipped a model that in most public benchmarks lands somewhere between GPT-5.4 and Claude Opus 4.6. It did so on open weights, under an MIT License, at output prices that are 98% lower than GPT-5.5 Pro and roughly 86% lower than Claude Opus 4.7.

For frontier closed labs, the message is unambiguous. The price floor on inference just dropped by an order of magnitude on every workload that does not require the very last benchmark point. The same dynamic showed up two weeks ago when OpenAI shipped GPT-5.5 six weeks after 5.4 and doubled the price. DeepSeek's V4 walks straight into that gap. Anthropic, OpenAI, and Google can now charge premium prices only on the workloads where their models genuinely beat V4. Everywhere else, the price will follow DeepSeek down.

For ML engineers, the message is simpler. Run a side-by-side eval this week. The math has changed.

VentureBeat's analysis put it most directly: V4 delivers "near state-of-the-art intelligence at one-sixth the cost." For every workload that does not need the very last percentage point on a benchmark, that math is already settled.

Sources

Practice with real Logistics & Shipping data

90 SQL & Python problems · 15 industry datasets

250 free problems · No credit card

See all Logistics & Shipping problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths