Google LiteRT-LM Accelerates Gemma 4 Local Inference

Google has added native support for Gemma 4 Multi-Token Prediction (MTP) to LiteRT-LM, its on-device LLM runtime built on LiteRT (formerly TensorFlow Lite), InfoQ reports. According to Google, MTP raises decoding speed by 1.6x for Gemma 4 E2B and 2.2x for Gemma 4 E4B, and the runtime delivers 1.8x to 3.7x faster prefill and decode than competing frameworks Google names, including llama.cpp, MLX, Cactus, and ONNX. InfoQ reports LiteRT-LM combines advanced quantization, accelerated XNNPACK and MLDrift kernels, speculative decoding for MTP, and pipelines that reduce CPU-GPU transfers. The runtime now exposes Swift and JavaScript APIs in addition to Kotlin and C++, per InfoQ.
What happened
Google added native support for Gemma 4 Multi-Token Prediction (MTP) to LiteRT-LM, its on-device LLM runtime built on LiteRT (formerly TensorFlow Lite), InfoQ reports. According to Google, MTP raises decoding speed by 1.6x for Gemma 4 E2B and 2.2x for Gemma 4 E4B versus single-token baselines, and the runtime delivers 1.8x to 3.7x faster prefill and decode than competing frameworks Google names, including llama.cpp, MLX, Cactus, and ONNX. The release also adds Swift and JavaScript bindings alongside existing Kotlin and C++ support, per InfoQ.
How it works
Per InfoQ's coverage of Google's description, LiteRT-LM layers an orchestration runtime on top of LiteRT and combines advanced quantization with accelerated XNNPACK and MLDrift kernels. It uses speculative decoding for MTP, shares a local KV cache, and optimizes pipelines to minimize CPU-GPU memory transfers, a common bottleneck for mobile LLMs. Session-management features support multi-turn use on device.
Why it matters
On-device decoding speedups of this size affect latency, battery use, and the size of model a phone or laptop can run locally, expanding what is feasible without a server round trip.
Editorial analysis - industry pattern
Pairing speculative multi-token drafting with shared local caches is an emerging technique for lifting on-device throughput without moving to larger models. The reported speedups are measured against baselines Google selected, so independent benchmarking on representative hardware is the usual next step before teams rely on the figures for capacity planning.
Scoring Rationale
A concrete, benchmarked improvement to on-device LLM inference that meaningfully affects latency, battery, and feasible local model size for mobile and edge developers. It is a notable engineering and tooling release with real adoption implications, but not a frontier-model launch, so it sits in the notable range.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems


