Xiaomi MiMo Code claims edge over Claude Code on 200-step tasks
Xiaomi's MiMo Code is the most direct open-source attack yet on Claude Code's long-horizon niche, and its architecture - not its benchmark table - is the part practitioners should study. Xiaomi released MiMo Code V0.1 in mid-June 2026 under an MIT license: a terminal-native coding agent forked from OpenCode, with a persistent memory system (project memory, session checkpoints, task-progress logs) maintained by an independent record-keeping subagent. In Xiaomi's own controlled experiment with both harnesses running the same MiMo model, MiMo Code scored 73% vs Claude Code's 68% on Terminal-Bench 2 and 62% vs 57% on SWE-Bench Pro; some launch coverage circulated larger gaps (86.7% vs 65.4%) from different configurations. Xiaomi also cites a 576-developer internal A/B evaluation where MiMo Code's win rate passed 65% beyond 200 steps. All figures are vendor-run and await independent replication.
Why this matters
The endurance problem - agents degrading as tasks stretch past dozens of steps - is the current frontier of practical coding-agent engineering. MiMo Code's design answers it with engineering rather than model scale: a dedicated subagent owns record-keeping, checkpointing state and summarizing context before the window fills, so the main agent never starts from scratch. That pattern is portable to any agent framework, which makes this release worth attention even from teams that will never run a Xiaomi model.
What happened
Xiaomi released and open-sourced MiMo Code V0.1 under an MIT license, a terminal-native AI coding agent built as a fork of the open-source OpenCode project. Launch coverage from The New Stack and VentureBeat dates the release to June 10-11, 2026; Xiaomi's official announcement page describes the release and its architecture. The agent installs via a single terminal command or npm on Windows, ships with Xiaomi's MiMo-V2.5 model built in (free for a limited time, and described by Xiaomi as comparable to Claude Sonnet 4.6), and supports third-party models including DeepSeek, Kimi, and GLM.
The benchmark numbers - read them carefully
Xiaomi's official announcement reports a controlled experiment in which MiMo Code and Claude Code ran the exact same MiMo model, isolating the harness itself: MiMo Code scored 73% vs Claude Code's 68% on Terminal-Bench 2, and 62% vs 57% on SWE-Bench Pro. Some launch coverage circulated larger figures (86.7% vs 65.4% on Terminal-Bench 2.0, and 57.2% on SWE-bench Pro) drawn from different configurations; accounts differ, and the official controlled-experiment numbers are the conservative, like-for-like comparison. Separately, Xiaomi cites an internal beta with a human A/B evaluation across 576 developers working in real private repositories, where MiMo Code's win rate against Claude Code rose above 65% once tasks passed 200 execution steps. Every figure here is vendor-run; none has been independently replicated yet.
Technical details
The memory system uses layered persistence - project memory, conversation checkpoints, and per-task progress logs - and VentureBeat's teardown documents an implementation built on SQLite FTS5 full-text search, including a scratch-notes layer. An independent subagent saves state automatically and produces a clean summary when the context window nears capacity, so the main agent continues instead of restarting. A /dream command runs every seven days, merging, deduplicating, and compressing accumulated memory into a compact current state. A Compose mode (triggered with the Tab key) runs a full design-plan-code-test-review loop from a single instruction.
Context and significance
Harness-versus-harness competition is displacing model-versus-model comparison in coding agents; GitHub's recent Copilot token-efficiency work points the same direction. An MIT-licensed, documented memory stack lowers the cost of experimenting with these ideas, and OpenCode's permissive base makes vertical forks straightforward. The obvious caveats: vendor benchmarks routinely flatter the vendor, Terminal-Bench and SWE-Bench Pro configurations materially affect scores, and long-horizon win rates from an internal beta are not a public benchmark.
What to watch
- •Independent Terminal-Bench 2 runs pitting MiMo Code against Claude Code and Codex CLI under matched configurations
- •Community adoption of the memory-subagent pattern in other OpenCode forks and agent frameworks
- •Whether Xiaomi publishes the evaluation harness behind its 576-developer A/B study
Key Points
- 1Xiaomi open-sourced MiMo Code V0.1 (MIT license), an OpenCode fork with subagent-maintained persistent memory targeting 200-plus-step coding workflows.
- 2Xiaomi's same-model controlled runs show 73% vs 68% on Terminal-Bench 2 and 62% vs 57% on SWE-Bench Pro; larger circulated gaps reflect different configurations.
- 3The transferable idea is outsourcing memory to a dedicated subagent rather than relying on the model to take notes - worth testing in any agent stack.
Scoring Rationale
Notable MIT-licensed coding-agent release whose official controlled experiment (same model in both harnesses) shows a consistent harness advantage: 73 vs 68 on Terminal-Bench 2 and 62 vs 57 on SWE-Bench Pro. Directly relevant to practitioners building agentic tooling, but all figures are Xiaomi-run, larger circulated gaps come from differing configurations, and independent replication is pending, so the score stays at 6.8.
Sources
Public references used for this report.
View 5 more sources
- 04Xiaomi's new open source, agentic AI coding harness MiMo Code beats Claude Code at ultra-long, 200+ step tasksventurebeat.com
- 05Xiaomi MiMo Code executes 200-step agentic developer workflowsdeveloper-tech.com
- 06Xiaomi's latest AI coding tool claims to outperform Claude Code on ...indianexpress.com
- 07Xiaomi open-sources MiMo Code AI coding agent, claims it outperforms Claude Code on complex 200-step software tasksgadgetsnow.indiatimes.com
- 08China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claudedecrypt.co
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems