Xiaomi MiMo Hits 1,000 Tokens-Per-Second Inference

Memeburn reports that Xiaomi's MiMo-V2.5-Pro-UltraSpeed achieved 1,000 tokens per second inference on standard, rentable cloud GPUs, which the outlet describes as the first trillion-parameter model to exceed that rate. According to Memeburn, the flagship is a 1.02-trillion-parameter Mixture-of-Experts (MoE) model and reached the milestone using three combined techniques: FP4 expert-layer quantization, DFlash block-level speculative decoding, and the TileRT persistent GPU runtime. Memeburn also reports a limited API trial for enterprise and professional developers running June 9-23, 2026, and says the FP4-DFlash checkpoint has been open-sourced on Hugging Face for independent testing. Editorial analysis: Industry observers note that such runtime and quantization gains commonly reduce per-token cost and broaden access to high-throughput LLM inference on commodity GPUs.
What happened
Xiaomi, in collaboration with inference partner TileRT, released MiMo-V2.5-Pro-UltraSpeed on June 8, 2026, claiming the first trillion-parameter model to exceed 1,000 tokens per second on a standard 8-GPU commodity node, with peak throughput near 1,200 tokens per second. The base model, MiMo-V2.5-Pro, a 1.02-trillion-parameter Mixture-of-Experts (MoE) architecture released April 22, 2026, is unchanged; UltraSpeed is a high-speed serving mode layered on top.
Technical approach
The throughput gains combine three co-designed layers. First, FP4 (MXFP4) quantization is applied selectively to MoE Expert layers only, the modules that hold most parameters and tolerate reduced precision best, while other modules retain higher precision. Quantization-Aware Training keeps benchmark performance "essentially on par" with the original, per Xiaomi's own benchmarks. Second, DFlash speculative decoding fills an entire block of masked positions in one forward pass rather than token-by-token, using a Sliding Window Attention draft model to keep per-step compute constant. In coding scenarios the average acceptance length reaches 6.30 per verification round, meaning 6-7 of 8 draft tokens survive. Third, TileRT's persistent GPU kernel eliminates per-operator launch overhead that fractures execution at microsecond scale, using Warp Specialization to overlap data movement and compute continuously.
Industry context
Achieving 1,000+ TPS on a 1T model via software co-design on commodity GPUs, rather than custom silicon such as Cerebras Wafer-Scale or Groq's on-chip SRAM, is significant for teams that cannot access specialized hardware. The MoE architecture keeps per-token active compute lower than equivalent dense models, which made the selective FP4-experts strategy tractable without broad quality regression.
Access and open source
The UltraSpeed API trial runs June 9-23, 2026, is application-based and limited to enterprises and professional developers, and is priced at 3x the standard MiMo-V2.5-Pro rate for roughly 10x the generation speed. Xiaomi has open-sourced the MiMo-V2.5-Pro-FP4-DFlash checkpoint on Hugging Face, including quantized weights and DFlash parameters, enabling independent community benchmarking.
What to watch
Independent community benchmarks of the open-sourced FP4-DFlash checkpoint for quality regressions; replication of the 1,000 TPS figure on common GPU instances beyond the trial; acceptance-length performance in open-ended conversation scenarios where current results are weaker; and whether TileRT's persistent kernel approach ports to standard serving stacks.
Scoring Rationale
A vendor-reported but well-documented inference milestone: 1,000+ TPS on a 1T MoE model using only commodity GPUs via FP4, DFlash, and a persistent runtime. Relevant to practitioners tracking inference cost and latency. Score at notable rather than major given the limited gated trial, vendor-only benchmarks pending independent replication, and weaker acceptance rates in open-ended conversation.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

