Skip to content

DeepSeek Made Its Models 85% Faster Without New Chips. The Code Is Free.

DS
LDS Team
Let's Data Science
8 min
DeepSeek and Peking University open-sourced DSpark, a speculative decoding framework that speeds per-user generation by 60 to 85 percent and lifts throughput by as much as 661 percent. It runs on the same Nvidia chips US export controls were meant to keep scarce, and the training code ships under an MIT license.

The most consequential AI release of the week was not a new model. It was a way to make existing ones run far faster on the hardware a lab already owns.

On June 27, DeepSeek and researchers at Peking University published DSpark, a framework that speeds up how quickly a model answers a single user by 60 to 85 percent on DeepSeek-V4-Flash. They did not train a larger network. They did not buy a single new graphics card. They changed the way the model writes its answers, then posted the whole thing to GitHub under an MIT license, free for anyone to copy.

For a Chinese lab the United States has spent two years trying to cut off from advanced chips, that combination is the entire story.

The practical question for anyone who pays a cloud bill to serve a model is simple. If the same GPU can now produce answers up to 85 percent faster at the same quality, the cost of every token you generate just dropped, and the queue of users a single server can handle just grew. DeepSeek says the gains are large enough to "enable performance tiers that were previously unattainable, shifting the Pareto frontier of our serving system."

The Bottleneck Was Never the Model. It Was One Word at a Time.

Large language models write the way a person types with one finger. They generate text one token at a time, and each new token has to wait for the previous one to finish. That serial process leaves expensive GPUs sitting idle between steps and makes long answers slow to arrive.

DSpark attacks that bottleneck with a technique called speculative decoding: a small, fast "draft" model guesses several upcoming tokens, and the large, accurate model then checks those guesses in a single batch instead of producing each token from scratch. When the guesses are right, the system skips ahead several words at once. When they are wrong, it falls back to the slow path. The output is identical to what the big model would have written alone, because the big model still has the final say on every token.

Speculative decoding is not new. DeepSeek's contribution is how aggressively it pushes the idea. Instead of proposing one token at a time, the DSpark drafter generates small groups of words together using what the paper calls a semi-autoregressive design. A confidence-based scheduler then decides how deeply to verify each guess depending on how busy the GPU is, so the system stops wasting compute double-checking proposals it is already confident about.

The result is a model that spends less time waiting and more time working.

The Numbers, and What Comes With Them

On DeepSeek-V4-Flash, DSpark raised per-user generation speed by 60 to 85 percent, according to VentureBeat's reporting on the release. On the larger DeepSeek-V4-Pro, the gain was 57 to 78 percent. Measured as raw throughput, the tokens a server pushes out across all users at once, DeepSeek's own charts show improvements reaching 661 percent over its previous serving baseline under live traffic.

DeepSeek did not stop at its own models. It also released DeepSpec, a full-stack codebase for training and evaluating the small draft models that make speculative decoding work, under the same MIT license. DeepSpec ships with three drafter algorithms and a benchmark suite that covers math, code, and chat.

ComponentWhat It IsLicense
DSparkThe new speculative decoding drafter and serving methodMIT
DeepSpecTraining and evaluation codebase for draft modelsMIT
DeepSeek-V4-Pro-DSparkThe Pro model packaged with the drafter, on Hugging FaceMIT

The framework includes three draft-model designs that practitioners can train and compare directly: DSpark, the prior DFlash design, and Eagle3, a widely used open speculative-decoding method. Across math benchmarks like GSM8K and MATH-500, coding tests like HumanEval and LiveCodeBench, and chat evaluations, DeepSeek reports that its DSpark drafter accepts more tokens per round than either alternative.

Critically for adoption, the team tested the approach on models it did not build, including Google's open Gemma family and Alibaba's Qwen. The gains held. That matters because it means DSpark is not a trick that works only on DeepSeek's architecture. Any team serving an open-weight model could, in principle, train a drafter and capture similar speedups. For engineers already running quantized models on constrained hardware, the technique stacks on top of the kind of local-inference work that put a 120-billion-parameter model on a laptop.

Why a Software Update Lands in Washington

To understand why a decoding framework counts as geopolitical news, look at what the United States has been doing to DeepSeek.

Since early 2025, US export controls have restricted China's access to Nvidia's most powerful data-center GPUs, the H100 and H800 class, treating compute as the chokepoint that would slow Chinese AI. DeepSeek has repeatedly found ways around the squeeze, training its trillion-parameter V4 model on chips it was not supposed to have and shipping an API that undercut Western frontier models by an order of magnitude.

DSpark attacks the chokepoint from a different direction. Export controls limit how many chips a lab can buy. They do nothing about how much work each chip can do. A pure software optimization that wrings 60 to 85 percent more speed out of existing hardware reduces the number of GPUs a lab needs to serve the same demand, which blunts the advantage chips were supposed to provide. The same logic helps the European Union, which also trails the US in data-center buildout and high-end silicon.

The decision to open-source it under MIT amplifies the effect. DeepSeek is not just helping itself. It is handing every lab and startup outside the US a free recipe for doing more with less, much as the open release of GLM-5.2's weights put frontier-class coding within reach of anyone with a download link.

The Other Side: Faster Chips Can Mean More Chips

The cleanest counterargument comes from economics, not engineering. It is called the Jevons paradox, and The Decoder raised it directly in its coverage: when a resource becomes more efficient to use, total consumption often rises rather than falls.

Cheaper, faster inference per query does cut the chips needed for a fixed amount of work. But freed-up capacity rarely sits idle. It gets absorbed by more users, longer context windows, and new applications that were too expensive to run before. DeepSeek's own language points the same way. The company frames DSpark as unlocking "performance tiers that were previously unattainable," which is a description of doing more, not buying less. Over a long enough horizon, total chip demand could stay flat or even grow.

There is a narrower technical caveat too. Speculative decoding delivers its biggest wins on long, predictable outputs and structured tasks where the draft model guesses well. On short answers or highly unpredictable generation, the speedup shrinks. The headline figure is a ceiling, not a guarantee, and the training pipeline is demanding: DeepSpec's documentation warns that preparing the target cache for even a small drafter can require roughly 38 terabytes of storage.

The Bottom Line

DeepSeek did not announce a smarter model this week. It announced a cheaper one to run, and then refused to keep the method to itself.

That is the part Washington cannot easily counter. Export controls were built on the premise that limiting chips limits AI. DSpark is a reminder that the same chip can be made to do dramatically more work through software alone, and that the people best motivated to find those efficiencies are precisely the ones being denied the hardware. Each percentage point of speed DeepSeek squeezes out is a percentage point of advantage the chip restrictions lose.

The drafters are on GitHub. The Pro model is on Hugging Face. The license costs nothing. For ML engineers, the takeaway is immediate and apolitical: the cost of serving an open model just fell, and the code to capture that is sitting in a public repository. As DeepSeek put it, the frontier that matters is no longer only how big a model can get. It is how much a single GPU can be made to give.

Sources

Practice with real Logistics & Shipping data

90 SQL & Python problems · 15 industry datasets

250 free problems · No credit card

See all Logistics & Shipping problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths