Infrastructureinference performancemi moxiaomimixture of experts

Xiaomi MiMo Hits 1,000 Tokens-Per-Second Inference

|June 12, 2026|By LDS Team

7.0

Relevance Score

Xiaomi MiMo Hits 1,000 Tokens-Per-Second Inference — Photo: memeburn.com · rights & takedowns

Memeburn reports that Xiaomi's MiMo-V2.5-Pro-UltraSpeed achieved 1,000 tokens per second inference on standard, rentable cloud GPUs, which the outlet describes as the first trillion-parameter model to exceed that rate. According to Memeburn, the flagship is a 1.02-trillion-parameter Mixture-of-Experts (MoE) model and reached the milestone using three combined techniques: FP4 expert-layer quantization, DFlash block-level speculative decoding, and the TileRT persistent GPU runtime. Memeburn also reports a limited API trial for enterprise and professional developers running June 9-23, 2026, and says the FP4-DFlash checkpoint has been open-sourced on Hugging Face for independent testing. Industry observers note that such runtime and quantization gains commonly reduce per-token cost and broaden access to high-throughput LLM inference on commodity GPUs.

What happened

Xiaomi, in collaboration with inference partner TileRT, released MiMo-V2.5-Pro-UltraSpeed on June 8, 2026, claiming the first trillion-parameter model to exceed 1,000 tokens per second on a standard 8-GPU commodity node, with peak throughput near 1,200 tokens per second. The base model, MiMo-V2.5-Pro, a 1.02-trillion-parameter Mixture-of-Experts (MoE) architecture released April 22, 2026, is unchanged; UltraSpeed is a high-speed serving mode layered on top.

Technical approach

The throughput gains combine three co-designed layers. First, FP4 (MXFP4) quantization is applied selectively to MoE Expert layers only, the modules that hold most parameters and tolerate reduced precision best, while other modules retain higher precision. Quantization-Aware Training keeps benchmark performance "essentially on par" with the original, per Xiaomi's own benchmarks. Second, DFlash speculative decoding fills an entire block of masked positions in one forward pass rather than token-by-token, using a Sliding Window Attention draft model to keep per-step compute constant. In coding scenarios the average acceptance length reaches 6.30 per verification round, meaning 6-7 of 8 draft tokens survive. Third, TileRT's persistent GPU kernel eliminates per-operator launch overhead that fractures execution at microsecond scale, using Warp Specialization to overlap data movement and compute continuously.

Industry context

Achieving 1,000+ TPS on a 1T model via software co-design on commodity GPUs, rather than custom silicon such as Cerebras Wafer-Scale or Groq's on-chip SRAM, is significant for teams that cannot access specialized hardware. The MoE architecture keeps per-token active compute lower than equivalent dense models, which made the selective FP4-experts strategy tractable without broad quality regression.

Access and open source

The UltraSpeed API trial runs June 9-23, 2026, is application-based and limited to enterprises and professional developers, and is priced at 3x the standard MiMo-V2.5-Pro rate for roughly 10x the generation speed. Xiaomi has open-sourced the MiMo-V2.5-Pro-FP4-DFlash checkpoint on Hugging Face, including quantized weights and DFlash parameters, enabling independent community benchmarking.

What to watch

Independent community benchmarks of the open-sourced FP4-DFlash checkpoint for quality regressions; replication of the 1,000 TPS figure on common GPU instances beyond the trial; acceptance-length performance in open-ended conversation scenarios where current results are weaker; and whether TileRT's persistent kernel approach ports to standard serving stacks.

Key Points

1Trillion-parameter MoE plus FP4 quantization, DFlash speculative decoding, and a persistent GPU runtime can unlock 1,000+ TPS on a standard 8-GPU commodity node, potentially lowering per-token inference cost at scale.
2Open-sourcing the FP4-DFlash checkpoint on Hugging Face enables independent validation, tuning, and faster community-driven performance improvements.
3UltraSpeed is priced at 3x the standard MiMo-V2.5-Pro rate for roughly 10x the generation speed, per Xiaomi, making it a distinct cost-performance tier for latency-critical workloads such as coding agents and real-time decision loops.

Scoring Rationale

A vendor-reported but well-documented inference milestone: 1,000+ TPS on a 1T MoE model using only commodity GPUs via FP4, DFlash, and a persistent runtime. Relevant to practitioners tracking inference cost and latency. Score at notable rather than major given the limited gated trial, vendor-only benchmarks pending independent replication, and weaker acceptance rates in open-ended conversation.

Sources

Public references used for this report.

4 sources

memeburn.comXiaomi MiMo Is Now 15x Faster Than ChatGPT: Here's What That Actually Means

gizmochina.comXiaomi MiMo-V2.5-Pro gets UltraSpeed Mode, breaks 1000 tokens per second

mimo.xiaomi.comMiMo-V2.5-Pro-UltraSpeed: Pushing 1T-Parameter Model Generation Speed to 1000 TPS

View 1 more source

MiMo-V2.5-Pro-UltraSpeed - Official API Documentationplatform.xiaomimimo.com

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems