Xiaomi MiMo Hits 1,000 Tokens-Per-Second Inference

Memeburn reports that Xiaomi's MiMo-V2.5-Pro-UltraSpeed achieved 1,000 tokens per second inference on standard, rentable cloud GPUs, which the outlet describes as the first trillion-parameter model to exceed that rate. According to Memeburn, the flagship is a 1.02-trillion-parameter Mixture-of-Experts (MoE) model and reached the milestone using three combined techniques: FP4 expert-layer quantization, DFlash block-level speculative decoding, and the TileRT persistent GPU runtime. Memeburn also reports a limited API trial for enterprise and professional developers running June 9-23, 2026, and says the FP4-DFlash checkpoint has been open-sourced on Hugging Face for independent testing. Industry observers note that such runtime and quantization gains commonly reduce per-token cost and broaden access to high-throughput LLM inference on commodity GPUs.
What happened
Xiaomi, in collaboration with inference partner TileRT, released MiMo-V2.5-Pro-UltraSpeed on June 8, 2026, claiming the first trillion-parameter model to exceed 1,000 tokens per second on a standard 8-GPU commodity node, with peak throughput near 1,200 tokens per second. The base model, MiMo-V2.5-Pro, a 1.02-trillion-parameter Mixture-of-Experts (MoE) architecture released April 22, 2026, is unchanged; UltraSpeed is a high-speed serving mode layered on top.
Technical approach
The throughput gains combine three co-designed layers. First, FP4 (MXFP4) quantization is applied selectively to MoE Expert layers only, the modules that hold most parameters and tolerate reduced precision best, while other modules retain higher precision. Quantization-Aware Training keeps benchmark performance "essentially on par" with the original, per Xiaomi's own benchmarks. Second, DFlash speculative decoding fills an entire block of masked positions in one forward pass rather than token-by-token, using a Sliding Window Attention draft model to keep per-step compute constant. In coding scenarios the average acceptance length reaches 6.30 per verification round, meaning 6-7 of 8 draft tokens survive. Third, TileRT's persistent GPU kernel eliminates per-operator launch overhead that fractures execution at microsecond scale, using Warp Specialization to overlap data movement and compute continuously.
Industry context
Achieving 1,000+ TPS on a 1T model via software co-design on commodity GPUs, rather than custom silicon such as Cerebras Wafer-Scale or Groq's on-chip SRAM, is significant for teams that cannot access specialized hardware. The MoE architecture keeps per-token active compute lower than equivalent dense models, which made the selective FP4-experts strategy tractable without broad quality regression.
Access and open source
The UltraSpeed API trial runs June 9-23, 2026, is application-based and limited to enterprises and professional developers, and is priced at 3x the standard MiMo-V2.5-Pro rate for roughly 10x the generation speed. Xiaomi has open-sourced the MiMo-V2.5-Pro-FP4-DFlash checkpoint on Hugging Face, including quantized weights and DFlash parameters, enabling independent community benchmarking.
What to watch
Independent community benchmarks of the open-sourced FP4-DFlash checkpoint for quality regressions; replication of the 1,000 TPS figure on common GPU instances beyond the trial; acceptance-length performance in open-ended conversation scenarios where current results are weaker; and whether TileRT's persistent kernel approach ports to standard serving stacks.
Key Points
- 1Trillion-parameter MoE plus FP4 quantization, DFlash speculative decoding, and a persistent GPU runtime can unlock 1,000+ TPS on a standard 8-GPU commodity node, potentially lowering per-token inference cost at scale.
- 2Open-sourcing the FP4-DFlash checkpoint on Hugging Face enables independent validation, tuning, and faster community-driven performance improvements.
- 3UltraSpeed is priced at 3x the standard MiMo-V2.5-Pro rate for roughly 10x the generation speed, per Xiaomi, making it a distinct cost-performance tier for latency-critical workloads such as coding agents and real-time decision loops.
Scoring Rationale
A vendor-reported but well-documented inference milestone: 1,000+ TPS on a 1T MoE model using only commodity GPUs via FP4, DFlash, and a persistent runtime. Relevant to practitioners tracking inference cost and latency. Score at notable rather than major given the limited gated trial, vendor-only benchmarks pending independent replication, and weaker acceptance rates in open-ended conversation.
Sources
Public references used for this report.
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems
