Skip to content

Every LLM Writes One Word at a Time. Google's New Model Writes 256 at Once.

DS
LDS Team
Let's Data Science
9 min
DiffusionGemma generates text the way image models generate pictures, denoising a 256-token block all at once instead of typing left to right. It hits more than 1,000 tokens per second on a single H100, fits on an 18GB gaming GPU, and ships under Apache 2.0. The catch is that it is not as smart as the model it was built from.

Open up ChatGPT, Claude, or Gemini and watch the cursor. The text arrives one word at a time, left to right, each word chosen only after the one before it is locked in place. That is not a quirk of the interface. It is how almost every large language model since GPT-2 has worked. The model predicts the next token, appends it, then predicts the next, in a strict sequence that cannot be parallelized.

On June 10, Google DeepMind released a model that throws that rule out.

It is called DiffusionGemma, and instead of typing, it prints. The model starts with a canvas of 256 random placeholder tokens and refines them all at once, making repeated passes that lock in the right words and use them as clues to fix the rest, until coherent text emerges from the noise. The technique is borrowed almost directly from image generators like Stable Diffusion, which turn visual static into a photograph through successive denoising steps. DiffusionGemma does the same thing to language.

The payoff is speed. Google reports up to 4x faster generation on dedicated GPUs, clocking more than 1,000 tokens per second on a single NVIDIA H100 and over 700 tokens per second on a consumer RTX 5090. And because it is a 26-billion-parameter Mixture-of-Experts model that activates only 3.8 billion parameters at inference, the quantized version fits inside 18GB of VRAM, the kind of memory you find on a high-end gaming card. This is one of the first credible signals that the autoregressive design underneath every chatbot you use is not the only way to build a language model at meaningful scale.

A Printing Press Instead of a Typewriter

The clearest way to understand DiffusionGemma is through the analogy Google's own research scientists use in the announcement. A standard language model is a typewriter. It strikes one key, waits, strikes the next. DiffusionGemma is a printing press. It stamps an entire block of text in a single motion.

The reason this matters comes down to how a GPU spends its time. When a typical model runs in the cloud, servers batch thousands of user requests together so the hardware is always busy. But when you run that same model locally, for a single user, the chip sits mostly idle, waiting for the next token before it can do anything. Generating word by word leaves a powerful GPU underused, bottlenecked not by raw compute but by memory bandwidth.

DiffusionGemma flips that. By drafting a 256-token block in one forward pass, it hands the processor a large chunk of work at once and keeps it saturated. The model shifts the bottleneck from memory bandwidth to compute, which is exactly the resource a modern GPU has in abundance.

That architectural choice unlocks a second property that left-to-right models simply cannot have: bidirectional attention. Because every token in the block is generated together, every token can see every other token, not just the ones that came before it. For ordinary prose that is a modest advantage. For non-linear structures it is a real one. The model can close a complicated piece of markdown formatting cleanly, infill code in the middle of a function rather than only at the end, or generate output where a later token constrains an earlier one.

To prove the point, the team at Unsloth fine-tuned DiffusionGemma to play Sudoku, a task autoregressive models are notoriously bad at because each cell depends on cells that have not been filled in yet. With bidirectional attention, the puzzle became tractable.

The Numbers That Make It a Workstation Model

Here is what the release actually puts in a developer's hands.

SpecDiffusionGemma
Architecture26B Mixture-of-Experts, 3.8B active at inference
Generation methodDiscrete text diffusion, 256 tokens per pass
Speed1,000+ tokens/sec on H100, 700+ tokens/sec on RTX 5090
Memory footprintFits within 18GB VRAM when quantized
LicenseApache 2.0 (weights downloadable now)
Runtimes on day onevLLM, Hugging Face Transformers, MLX

The model is built on the Gemma 4 backbone, inheriting that family's intelligence-per-parameter, with a new diffusion head bolted on top and lessons carried over from Google's earlier Gemini Diffusion research. It arrived with the kind of ecosystem support that usually takes weeks to materialize: day-one serving through vLLM with integration help from Red Hat, native Hugging Face Transformers support, and MLX for Apple Silicon. Google also worked with NVIDIA to ship NVFP4 4-bit kernels that let the model run on RTX 5090 and 4090 cards as well as DGX Spark and RTX PRO systems, with llama.cpp support promised soon.

For a small team or a solo developer, that combination is the story. A model that runs locally, generates faster than most cloud APIs, fits on hardware you can buy at retail, and carries a license that lets you modify and deploy it freely is a different tier of capability than was available to that same developer a year ago. It pairs naturally with the recent wave of local-first releases, like the 120-billion-parameter model Nvidia squeezed onto a laptop, that are quietly moving serious inference off the cloud and onto the desk.

The Honest Caveat Google Printed Itself

Google did not bury the downside, which is unusual enough to note. The company states plainly that DiffusionGemma's overall output quality is lower than standard Gemma 4 on benchmarks including MMLU and coding evaluations. This is an experimental, speed-optimized model, not a quality upgrade. For any application where accuracy is the priority, Google's own recommendation is to keep using the autoregressive Gemma 4.

There is a second limit worth understanding before anyone rewrites their stack around it. The speed advantage is designed for local and low-concurrency use. In high-throughput cloud serving, where autoregressive models can batch requests to saturate the hardware anyway, DiffusionGemma's parallel decoding offers diminishing returns and can actually cost more to serve. The win is strongest at low-to-medium batch sizes on a single accelerator, which is to say on a workstation, not in a data center.

Even the hardware benefit is conditional. Google notes that unified-memory architectures like Apple Silicon Macs, which are often bandwidth-bound rather than compute-bound during inference, may not see the same acceleration over Gemma 4 at all. The printing press only outruns the typewriter when the press has compute to burn.

The Skeptics Have a Point About "First"

Diffusion-based text generation is not new, and anyone presenting DiffusionGemma as a from-nothing breakthrough is overselling it. The research community has explored text diffusion for years, and Google itself shipped Gemini Diffusion research before this. Startups like Inception Labs have pushed commercial diffusion language models, and the academic literature on discrete diffusion for text stretches back well before 2026. What DiffusionGemma changes is not the idea but the packaging: a frontier-lab model family, open weights, a permissive license, and real serving support, applied to diffusion at a scale most prior attempts never reached.

The harder question is whether speed without quality is worth much. A model that is faster but measurably less accurate is a niche tool, not a replacement, and the use cases Google highlights, like in-line editing and code infilling, are real but narrow. The optimistic case rests on a bet: that diffusion text generation will close the quality gap over the next few model generations while keeping its speed advantage. If that happens, the economics of running capable AI locally on consumer hardware change, because the sequential, one-token-at-a-time bottleneck that has always slowed local inference would be gone. If it does not, DiffusionGemma will be remembered as a clever experiment that ran fast and thought slow.

For developers deciding whether to care, the open-source model field in 2026 already offers more choices than any team can evaluate, and a deeper read on why tokenization and sequential generation shape model speed explains exactly what DiffusionGemma is trying to escape. The model also sits in the lineage of Google's Gemma 4 open release, whose backbone it borrows.

The Bottom Line

For five years, the answer to "how does a language model generate text" has been the same: one token at a time, in order, no exceptions. DiffusionGemma is the first time a major lab has open-sourced a credible alternative at scale and handed it to anyone with a gaming GPU and an internet connection. It is faster, it is local, it is free to modify, and it is honestly worse at the things models are usually measured on.

That tradeoff is the entire point. Google is not claiming diffusion beats autoregression today. It is claiming the architecture is worth exploring, and it put 26 billion parameters and a full serving stack behind the claim so the rest of the field can test it. The interesting question is no longer whether text diffusion works. It clearly does. The question is whether the quality gap closes before the novelty wears off.

The typewriter has had a five-year monopoly on how machines write. For the first time, there is a printing press in the room, and the weights are already on Hugging Face.

Sources

Practice with real Ad Tech data

90 SQL & Python problems · 15 industry datasets

250 free problems · No credit card

See all Ad Tech problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths