Google's new open-source model family ranks #3 on Arena AI, supports 140+ languages, and ships under Apache 2.0 for the first time. But developers testing the 31B model discovered it runs at one-fifth the speed of Alibaba's Qwen 3.5 on the same hardware.

On April 2, a Google researcher named Clement Farabet published a blog post with a title that read more like a dare than an announcement: "Gemma 4: Byte for byte, the most capable open models." Within hours, developers had downloaded the weights from Hugging Face, loaded them onto GPUs, and started running the benchmarks Google didn't highlight.

What they found was more complicated than the press release suggested.

The 31B dense model did rank #3 on the Arena AI text leaderboard with a score of 1452. The 26B mixture-of-experts variant hit #6 with 1441. On paper, Gemma 4 outperformed models 20 times its size. But the first developers to actually run inference on their own hardware noticed something Google's announcement omitted: the MoE model generated text at 11 tokens per second on the same GPU where Alibaba's Qwen 3.5 hit 60+.

The benchmarks told one story. The production experience told another.

The License Change That Matters More Than the Models

Previous Gemma releases carried restrictions that made enterprise lawyers nervous. Commercial deployment required navigating terms that varied by use case. Redistribution was complicated. Fine-tuned derivatives lived in a gray area.

Gemma 4 ships under Apache 2.0. That single change removes the legal friction that kept Gemma out of production systems where Llama and Qwen thrived. Any developer can now build commercial products on Gemma 4, redistribute modified versions, and integrate the models into proprietary pipelines without requesting permission.

Google built Gemma 4 on technology from Gemini 3 Pro, its flagship closed model released in late 2025. The inheritance shows in the architecture. Four sizes cover every deployment scenario: 2B and 4B effective parameters for smartphones and laptops, a 26B MoE that activates only 4B parameters per token for efficiency, and a 31B dense model for servers. Context windows reach 128K tokens on edge models and 256K on the larger variants. All four handle images and video. The smaller models also process audio.

The 400+ million downloads of earlier Gemma versions prove the demand exists. Apache 2.0 removes the last barrier between demand and deployment.

The Benchmarks Tell a Competitive Story

Gemma 4's numbers are strong. They are not dominant.

Benchmark	Gemma 4 31B	Qwen 3.5 27B	Winner
MMLU Pro	85.2%	~84%	Gemma 4
AIME 2026 (Math)	89.2%	~85%	Gemma 4
LiveCodeBench v6	80.0%	~78%	Gemma 4
GPQA Diamond (Science)	84.3%	85.5%	Qwen 3.5
MMMLU (Multilingual)	88.4%	85.9%	Gemma 4
Codeforces ELO	2150	~2100	Gemma 4

Gemma 4 wins more categories than it loses. But the margins are thin enough that a single benchmark update could flip the leaderboard. April 2026 is the most crowded month in open-source AI history: Alibaba released Qwen 3.6-Plus on the same day with a 1M token context window. Meta's Llama 4 Scout already offered 10M tokens. Google entered a knife fight, not a coronation.

The Arena AI preference scores add a twist. Humans consistently preferred Gemma 4's responses over competitors even when automated benchmark scores were nearly identical. The 31B's Arena AI ELO of 1452 placed it above models with far more parameters. Something about how Gemma 4 structures its answers resonates with human evaluators in ways that accuracy metrics don't capture.

The Speed Cliff Nobody Expected

Twenty-four hours after release, the complaints started piling up on developer forums.

The 26B MoE model was supposed to be efficient. MoE architectures activate only a fraction of their total parameters on each token, which theoretically means faster inference. In practice, community testers measured 11 tokens per second on GPUs where Qwen 3.5 produced 60+ tokens per second. The dense 31B fared better but still landed at 18-25 tokens per second on dual GPUs.

Memory told a similar story. On identical hardware, Qwen 3.5's 27B model supported a 190K token context window. Gemma 4 fit far less. The 31B requires 58GB in BF16 precision or 17GB quantized to Q4_0 format, which left little room for long contexts on consumer GPUs.

One community member summarized the findings bluntly on the LMArena Discord:

"Gemma 4 ties with Qwen, if not Qwen slightly ahead. Qwen 3.5 is more compute efficient too."

The speed gap forced an immediate trade-off calculation. Teams that prioritize throughput and memory efficiency still have reason to choose Qwen. Teams that prioritize multilingual quality and human-preference alignment have reason to choose Gemma 4. Neither model won on every axis.

The Multilingual Advantage Changes Calculus for Global Teams

If speed were the only measure, Gemma 4 would be a hard sell. But multilingual performance complicates the picture.

Community testers evaluating German, Arabic, Vietnamese, and French outputs reported that Gemma 4 outperformed Qwen 3.5 consistently in non-English tasks. One researcher called the translation quality "in a tier of its own." The MMMLU win (88.4% vs. 85.9%) confirmed the gap quantitatively.

For teams building products in Southeast Asia, Latin America, Europe, or Africa, this changes the math. A model that runs slower but produces markedly better output in the user's native language may justify the latency penalty. The 140+ language support and 256K context window mean longer documents in any language can fit in a single pass.

A few months ago, a 9B model on a phone beat a 120B cloud model by being in the right place at the right time. Gemma 4's edge models follow that pattern. The 2B and 4B variants run on-device with 4x faster inference and 60% lower battery consumption than Gemma 3. For mobile-first markets where most users don't have access to cloud GPUs, that's the real breakthrough.

Fine-Tuning Hit a Wall on Day One

The speed problems were manageable. The tooling problems were not.

HuggingFace Transformers didn't recognize the Gemma 4 architecture on release day. PEFT (Parameter-Efficient Fine-Tuning) couldn't handle the new Gemma4ClippableLinear layers. A novel mm_token_type_ids field required custom workarounds that most teams didn't have time to build.

Teams that planned to fine-tune Gemma 4 for domain-specific tasks found themselves waiting for upstream library updates. In a competitive landscape where Qwen and Llama models work out of the box with existing tooling, that delay matters. Early adopters reported additional stability issues: infinite loops in some edge cases, jailbreak vulnerabilities, and hard crashes on Mac hardware under sustained load.

Google shipped function calling, structured JSON output, and system instructions for agentic workflows. The features exist. The ecosystem support to use them reliably does not exist yet.

The Other Side: Benchmarks Aren't Deployment

The skeptical case against Gemma 4 hype is straightforward: models that score well on benchmarks don't always win in production.

China's open-source AI ecosystem has been gaining ground precisely because models like Qwen prioritize practical deployment characteristics over leaderboard positions. Qwen 3.5 runs faster. It uses less memory. Its tooling works. It already has a production user base that trusts it.

Gemma 4 enters a market where "best on paper" isn't enough. The Apache 2.0 license removes a genuine barrier, but it doesn't solve the speed problem, the memory problem, or the fine-tuning compatibility problem. Those are engineering debts that Google will need to pay down over the coming weeks.

The counterargument in Gemma 4's favor: Google has a track record of shipping rough early releases that improve rapidly. Gemma 3 had similar complaints at launch and became a solid choice within months. The Gemini 3 Pro foundation underneath Gemma 4 gives Google a deep well of architectural improvements to draw from. And the Arena AI preference data suggests that something about Gemma 4's output quality resonates with users in ways that raw accuracy scores miss.

The Bottom Line

Google released its most capable open model family into the most competitive open-source landscape that has ever existed. The benchmarks are strong. The license is right. The multilingual performance is genuinely excellent. The speed is genuinely bad.

Developers choosing between Gemma 4, Qwen 3.5, and Llama 4 in April 2026 aren't picking a winner. They're making trade-offs. Multilingual quality versus throughput. Benchmark accuracy versus deployment simplicity. Apache 2.0 freedom versus battle-tested tooling.

The 400+ million downloads of earlier Gemma versions say the audience exists. The 24-hour community findings say the product isn't finished. Google gave developers the license they wanted and the benchmarks they respect. Now it needs to give them the speed, the stability, and the tooling to actually ship.

As one early tester put it: the model is impressive. The experience of running it is not.