Products & Toolsruntimeon devicegemma 4multitoken prediction

Google LiteRT-LM Accelerates Gemma 4 Local Inference

|June 5, 2026|By LDS Team

6.8

Relevance Score

Google LiteRT-LM Accelerates Gemma 4 Local Inference — Photo: res.infoq.com · rights & takedowns

Google has added native support for Gemma 4 Multi-Token Prediction (MTP) to LiteRT-LM, its on-device LLM runtime built on LiteRT (formerly TensorFlow Lite), InfoQ reports. According to Google, MTP raises decoding speed by 1.6x for Gemma 4 E2B and 2.2x for Gemma 4 E4B, and the runtime delivers 1.8x to 3.7x faster prefill and decode than competing frameworks Google names, including llama.cpp, MLX, Cactus, and ONNX. InfoQ reports LiteRT-LM combines advanced quantization, accelerated XNNPACK and MLDrift kernels, speculative decoding for MTP, and pipelines that reduce CPU-GPU transfers. The runtime now exposes Swift and JavaScript APIs in addition to Kotlin and C++, per InfoQ.

What happened

Google added native support for Gemma 4 Multi-Token Prediction (MTP) to LiteRT-LM, its on-device LLM runtime built on LiteRT (formerly TensorFlow Lite), InfoQ reports. According to Google, MTP raises decoding speed by 1.6x for Gemma 4 E2B and 2.2x for Gemma 4 E4B versus single-token baselines, and the runtime delivers 1.8x to 3.7x faster prefill and decode than competing frameworks Google names, including llama.cpp, MLX, Cactus, and ONNX. The release also adds Swift and JavaScript bindings alongside existing Kotlin and C++ support, per InfoQ.

How it works

Per InfoQ's coverage of Google's description, LiteRT-LM layers an orchestration runtime on top of LiteRT and combines advanced quantization with accelerated XNNPACK and MLDrift kernels. It uses speculative decoding for MTP, shares a local KV cache, and optimizes pipelines to minimize CPU-GPU memory transfers, a common bottleneck for mobile LLMs. Session-management features support multi-turn use on device.

Why it matters

On-device decoding speedups of this size affect latency, battery use, and the size of model a phone or laptop can run locally, expanding what is feasible without a server round trip.

Editorial analysis - industry pattern

Pairing speculative multi-token drafting with shared local caches is an emerging technique for lifting on-device throughput without moving to larger models. The reported speedups are measured against baselines Google selected, so independent benchmarking on representative hardware is the usual next step before teams rely on the figures for capacity planning.

Key Points

1LiteRT-LM adds native Gemma 4 multi-token prediction, raising on-device decoding speed 1.6x (E2B) to 2.2x (E4B) per Google, cutting per-inference latency.
2Google reports 1.8x to 3.7x faster prefill and decode than llama.cpp, MLX, Cactus, and ONNX, plus new Swift and JavaScript bindings.
3Pairing speculative multi-token drafting with shared local KV caches is an emerging way to lift on-device throughput without moving to larger models.

Scoring Rationale

A concrete, benchmarked improvement to on-device LLM inference that meaningfully affects latency, battery, and feasible local model size for mobile and edge developers. It is a notable engineering and tooling release with real adoption implications, but not a frontier-model launch, so it sits in the notable range.

Sources

Public references used for this report.

4 sources

infoq.comGoogle LiteRT-LM Speeds Up Local Inference Up to 2.2x With Gemma 4

developers.googleblog.comBlazing fast on-device GenAI with LiteRT-LM

blog.googleAccelerating Gemma 4: faster inference with multi-token prediction drafters

View 1 more source

Speed-up Gemma 4 with Multi-Token Predictionai.google.dev

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems