Skip to content

A 9B Model on Your Phone Just Beat a 120B Cloud Model

DS
LDS Team
Let's Data Science
9 minAudio · 1 listens
Listen Along
0:00/ 0:00
AI voice
Alibaba's Qwen 3.5 Small series scores 81.7 on GPQA Diamond — outperforming OpenAI's 120-billion-parameter open model — while running on a budget Android phone with 6GB of RAM.

On March 2, 2026, Alibaba's Qwen team released four small language models — Qwen3.5-0.8B, Qwen3.5-2B, Qwen3.5-4B, and Qwen3.5-9B — all designed to run locally on phones, laptops, and edge devices without an internet connection. The flagship 9B model then did something that stopped the AI community cold: it scored 81.7 on GPQA Diamond, a graduate-level science reasoning benchmark, beating OpenAI's gpt-oss-120B (80.1) — a model with more than thirteen times the parameters.

A 9-billion-parameter model available for free, running offline on your laptop, outscoring a closed 120-billion-parameter model on PhD-level science questions. That is the state of on-device AI in early 2026.

The Qwen 3.5 Small Series Packs Serious Architecture

The four models share a common foundation rooted in Qwen3.5's architecture, which introduces a hybrid attention mechanism called Gated DeltaNet — a 3:1 ratio of linear-to-full attention blocks that dramatically lowers the memory and compute cost of processing long contexts. The 9B model uses 32 layers, a 4096-dimension hidden space, and a 248,320-token vocabulary covering 201 languages and dialects.

All four models support a 262,144-token native context window — that is roughly 200,000 words, or several full-length novels — extensible to one million tokens using a technique called YaRN (Yet another RoPE extensioN). Every model in the series handles text, images, and video natively in the same architecture.

The series also inherits a "thinking mode" from larger Qwen3.5 models, where the model can reason through a problem step by step before producing an answer. Developers can toggle this on or off depending on whether they need deeper reasoning or faster responses.

Apache 2.0 licensing covers all four models, meaning commercial use is fully permitted with no user-count restrictions.

The Benchmark Numbers Tell a Striking Story

GPQA Diamond is a set of just 198 multiple-choice questions in biology, physics, and chemistry, written by PhD-holding domain experts specifically to be difficult for non-experts — including AI systems. It measures graduate-level scientific reasoning, not general trivia. Human domain experts score around 65 percent on questions outside their specialty. Scoring 81.7 means Qwen3.5-9B is answering questions that stump most scientists.

Here is how the 9B model compares on key benchmarks:

ModelParametersGPQA DiamondMMLU-ProMMMU-Pro (Vision)Where It Runs
Qwen3.5-9B9B81.782.570.1Laptop (INT4, ~5GB RAM)
gpt-oss-120B120B80.180.8Cloud API
Qwen3.5-4B4B76.279.166.3Phone (6GB RAM)
Qwen3.5-2B2B51.655.347.7Any modern phone
Qwen3.5-0.8B0.8BIoT, embedded devices
GPT-5-Nano57.2Cloud API
Gemini 2.5 Flash-Lite59.7Cloud API

On video understanding (Video-MME with subtitles), Qwen3.5-9B scores 84.5 against Gemini 2.5 Flash-Lite's 74.6 — a gap of nearly ten points. On MathVision (mathematical visual reasoning), the 9B scores 78.9 compared to 62.2 for GPT-5-Nano.

The 9B also beats Alibaba's previous-generation Qwen3-30B-A3B-Thinking-2507 on GPQA Diamond (81.7 vs. 73.4), MMLU-Pro (82.5 vs. 80.9), and long-document reasoning via LongBench v2 (55.2 vs. 44.8). That predecessor model has been outperformed by a model that fits on a gaming laptop.

The Models Actually Run on Phones

Benchmark scores mean nothing if the hardware requirements are out of reach. Qwen3.5 small models run on consumer hardware that most people already own.

At four-bit quantization (INT4) — a compression technique that reduces model precision to shrink file size without much accuracy loss — the memory footprints drop sharply:

  • 0.8B model: approximately 0.5GB, runs on smartphones and IoT devices
  • 2B model: approximately 1.5GB, runs on any modern phone with 4GB RAM
  • 4B model: approximately 3GB, runs on laptops and M1/M2 Macs
  • 9B model: approximately 5GB, runs on an RTX 3090, RTX 4090, or M2 Pro MacBook

A developer on Hacker News shared the project running Qwen3.5-2B on budget Android phones using llama.cpp compiled as a native Android library via the NDK. The 2B model produced roughly 8 tokens per second, with full vision capability enabled. The app ran in airplane mode. No data left the phone. Community members with devices like the Poco X6 Pro — a $300 Android phone — reported testing the implementation.

On Apple hardware, the 0.8B and 2B models run on iPhone via the MLX Swift framework, which uses Apple Silicon's unified memory to share compute between CPU and GPU. The 0.8B model achieves over 22 tokens per second on an iPhone — faster than most people read.

For developers, deployment options include Ollama (one-line install), LM Studio (graphical interface, supports iOS), llama.cpp (maximum control), and vLLM or SGLang for production server deployment. All model weights are on Hugging Face at huggingface.co/Qwen and on Alibaba's ModelScope hub.

On-Device AI Solves Problems That Cloud Models Cannot

Running a capable AI model entirely on your own hardware is not just a technical novelty. It addresses three problems that cloud-based AI cannot solve.

Privacy. When a doctor dictates patient notes, when a lawyer drafts a contract, when a journalist works with confidential sources, sending that text to a remote server is a liability. A model running on-device processes data that never leaves the machine. No API logs, no training data harvesting, no potential breach.

Latency. Cloud model round-trips add 200 to 800 milliseconds per request under normal conditions, more during peak demand. On-device inference is bounded only by local hardware — essential for real-time applications like live transcription, robotics control, or document scanning in the field.

Cost. Self-hosting Qwen3.5-9B costs approximately $0.05 to $0.15 per million tokens in electricity and hardware amortization. Frontier model APIs charge between 10 and 30 dollars per million tokens. For high-volume workloads — document processing, customer support automation, code review pipelines — the economics of local inference become compelling fast.

The Qwen3.5 series also enables a new category of offline-capable AI products: apps that work on a flight, in a hospital with strict network policies, in areas with unreliable connectivity, or on devices deliberately air-gapped from the internet.

Qwen's Place in Alibaba's Broader Strategy

Qwen3.5 Small is not an isolated product. It is part of a systematic push by Alibaba to blanket every tier of the compute spectrum with competitive open-weight models.

The full Qwen3.5 family spans from the 0.8B small model to a 397-billion-parameter mixture-of-experts flagship. The strategy mirrors what Meta accomplished with Llama — build developer trust and ecosystem adoption with free, open-weight models, then monetize through Alibaba Cloud API services for users who want managed hosting.

Alibaba's Qwen3.5 release documentation describes the training approach as "scalable RL on million-agent environments," meaning reinforcement learning was applied at a scale that previous model generations could not reach. This training method, rather than brute-force parameter scaling, accounts for much of the small models' outperformance relative to their size.

Where the 9B Falls Short

The GPQA Diamond score is real. But the XDA Developers analysis of Qwen3.5-9B makes a point worth hearing: "the kind of work most people do with language models — writing, summarizing, coding, brainstorming — doesn't look anything like picking option F out of ten choices on a physics problem."

Benchmarks measure specific capabilities. On real-world tasks where context, nuance, and extended reasoning chains matter, gpt-oss-120B still outperforms in documented tests. Qwen3.5-9B wins where it was trained to win: academic knowledge, multilingual comprehension, and structured document understanding.

Independent document processing benchmarks expose specific weaknesses. On table extraction (GrITS), Qwen3.5-9B scores 76.6 versus 96.4 for Gemini 3.1 Pro and 96.3 for Claude Sonnet 4.6 — a gap of nearly 20 points. The same score appears on both the 4B and 9B variants, suggesting a structural training limitation rather than a capacity issue that more parameters would fix. Handwriting recognition, at 65.5, lags the category leader by 17 points. If your pipeline depends on table extraction or handwriting, Qwen3.5-9B is not ready for production.

The community reception on Hacker News has been enthusiastic but measured. Developers with 780-plus GitHub stars and 2,000-plus downloads of the Android implementation are excited about the hardware access story. Several comments flag that benchmark supremacy does not guarantee production reliability — users report the model handles routine tasks well but occasionally struggles with complex multi-step instructions that larger models handle cleanly.

The Bottom Line

Qwen3.5-9B scores 81.7 on GPQA Diamond — higher than OpenAI's 120-billion-parameter open model — and runs in 5GB of RAM on a laptop you already own. The 2B model runs on a $300 Android phone, in airplane mode, for free.

Alibaba has not built a GPT-4 killer. It has built something arguably more interesting: a family of models that brings graduate-level science reasoning, multilingual fluency, and vision capability to hardware that billions of people already carry in their pockets. For privacy-sensitive applications, offline use cases, and cost-driven deployments, the small model calculus has shifted decisively.

The cloud AI model has competition from a model that fits in your palm.

Sources

Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths