Skip to content

Mistral Just Merged Its Coding and Reasoning Models Into One 128B Open-Weight Release.

DS
LDS Team
Let's Data Science
9 min
Mistral Medium 3.5 is a single 128-billion-parameter dense model under a modified MIT license. It hits 77.6% on SWE-Bench Verified, replaces Devstral 2 in the Vibe CLI, and retires Magistral from Le Chat. It runs self-hosted on four GPUs and ships with cloud coding agents and a new Work mode for multi-step tasks.

The post landed on Mistral's blog on Thursday afternoon Paris time, two paragraphs in before the company quietly buried the lede. Mistral Medium 3.5, the new flagship, is "our first flagship merged model." Translation: the dedicated reasoning model (Magistral) and the dedicated coding model (Devstral 2) are both being retired into a single dense 128B set of weights. One model, three jobs, configurable reasoning effort per request. Open weights. Modified MIT license.

For practitioners, that consolidation is the headline. For the open-source AI ecosystem, the license is the headline. For Mistral itself, the timing is the headline. The release shipped on April 30, 2026, fewer than thirty days after Mistral borrowed 830 million dollars to build a 13,800-GPU data center outside Paris. The new model is the first flagship trained on that incoming compute.

What Shipped

Mistral Medium 3.5 is dense, not sparse. 128 billion parameters, 256K context window, vision encoder trained from scratch to handle variable image sizes and aspect ratios, and a configurable reasoning-effort setting that lets the same weights answer a quick chat reply or grind through a long agentic task. Mistral published the model card on Hugging Face under a modified MIT license, with API pricing at 1.50 dollars per million input tokens and 7.50 dollars per million output tokens. Self-hosting works on as few as four GPUs.

The benchmark numbers Mistral chose to lead with say what the model is actually good at:

BenchmarkScoreWhat It Measures
SWE-Bench Verified77.6%Real-world software engineering bug fixes from open-source repos
τ³-Telecom91.4Multi-turn agentic tool calling in a telecom-domain test suite
MMLU-Pro / GPQANot disclosed in launch postGeneral knowledge and graduate-level science

The 77.6% SWE-Bench Verified result is the operationally important one. It puts Mistral Medium 3.5 ahead of Mistral's own previous coding specialist, Devstral 2, and ahead of Qwen3.5 397B A17B (a much larger sparse mixture-of-experts model). It is also competitive with the proprietary leaders without being best-in-class. The frontier closed-weight models from Anthropic and OpenAI have been reporting higher SWE-Bench Verified scores in their own releases earlier this year. The story Mistral is telling is not "we beat the frontier." It is "we got close enough at this size that you can run it yourself."

For context: SWE-Bench Verified is the de facto industry benchmark for real-world coding ability. Models are graded on whether their generated patches actually fix real bugs in real Python repos. A score in the high 70s on Verified is near the line where coding agents stop hallucinating fixes and start producing patches that pass test suites.

The Merger Is the Strategy

Mistral has been shipping specialized model lines for two years. Magistral was the reasoning specialist. Devstral 2 was the coding specialist. The base Mistral Medium series was the chat workhorse. Each had its own training run, its own tuning, its own integration story.

Medium 3.5 collapses that. The launch post says it directly: "Mistral Medium 3.5, a new flagship model that merges instruction-following, reasoning, and coding into a single 128B dense model." Magistral has been retired from Le Chat. Devstral 2 has been replaced as the default in the Vibe CLI. Le Chat's default model is now Medium 3.5 across the board.

That mirrors a pattern OpenAI and Anthropic followed earlier in 2026: consolidate model SKUs into fewer, more general models with configurable behavior at inference time. OpenAI shipped GPT-5.5 in mid-April with a single model replacing GPT-5 and o3-class reasoning. Anthropic collapsed Claude into a single Opus 4.7 SKU on April 16. Mistral has now done the same thing with the added wrinkle that the merged result is open-weight.

For ML engineers the practical implication is concrete: one set of weights to evaluate, deploy, monitor, and fine-tune. No more routing logic to choose between a coder model and a reasoner model. The configurable reasoning effort means a single API call decides how hard the model thinks, rather than a model picker upstream.

The License Question

The "modified MIT license" framing is the single most contested detail in the launch.

A vanilla MIT license would allow commercial use, redistribution, and modification with essentially no restrictions. Mistral's modification (per the Hugging Face page) imposes additional terms around commercial use at scale and around derivative model release. The community spent the first 24 hours after release parsing what those modifications actually mean for downstream startups.

The Hacker News discussion thread that opened on release day filled with the same arguments that ran on Llama 3 and 4: a license that calls itself "open" but adds use-case restrictions is harder to justify in regulated industries that require clean upstream licensing. The defenses pointed out that Mistral's terms are still substantially more permissive than the Llama Community License, and that for self-hosted enterprise use the modifications do not bite.

The practical question for practitioners is whether a downstream startup can ship a fine-tune of Medium 3.5 as part of a commercial product. Reading the modified MIT carefully, the answer is yes for most cases, with attribution and notification requirements, plus restrictions on building competing hosted-API services. That is roughly the same shape as the deals Hugging Face has structured around other "almost-MIT" releases over the past year.

Vibe Goes Async

The model is half the announcement. The other half is Mistral Vibe's cloud-side overhaul.

Mistral Vibe, the company's coding agent product, used to live on the developer's laptop. Starting Thursday, Vibe sessions can run in the cloud, in parallel, while the developer steps away. A local CLI session can be "teleported" up to the cloud with full session history and approval state intact. The cloud agents plug into GitHub for code and pull requests, Linear and Jira for issues, Sentry for incidents, and Slack or Teams for reporting.

When the cloud agent is done, it opens a pull request on GitHub and sends a notification. The human reviews the result instead of every keystroke that produced it.

That product shape is the same one OpenAI shipped with the rebuilt Codex on April 17 and the same one Anthropic ships with Claude Code in cloud mode. Mistral's pitch is that it plugs into the same SDLC ecosystem (GitHub, Linear, Sentry) but uses an open-weight model under the hood that the customer can also self-host.

A new Work mode in Le Chat (Preview) extends the agent harness past coding into general productivity. Connectors are on by default. The agent reads and writes across email, calendar, docs, and tools, asks for explicit approval before sending or modifying data, and persists across multiple turns. It is Mistral's answer to ChatGPT Agent Mode and Anthropic's Claude Cowork.

What Practitioners Should Actually Look At

For ML engineers and data scientists evaluating Medium 3.5 for production use, three numbers matter most:

  • Self-hosting threshold: 4 GPUs. A 128B dense model in roughly half-precision fits on four H100s with capacity left for kv cache. That is well inside the budget of mid-sized enterprises that already have on-prem inference. The same model in 4-bit quantization will likely run on two H100s with reasonable throughput once community quantizations land.
  • API cost: 1.50 dollars per million input tokens and 7.50 dollars per million output tokens. That puts Medium 3.5 between Claude Sonnet (cheaper) and Claude Opus or GPT-5.5 (more expensive) in API pricing. The economic case for the open-weight version is strongest for customers running over roughly 50 million tokens per day, where the fixed-cost compute of self-hosting beats per-token API pricing.
  • SWE-Bench Verified at 77.6%. Below the proprietary frontier. Above the previous open-weight frontier. Roughly comparable to where Claude Sonnet 4.5 sat on the same benchmark in late 2025.

For startups evaluating Medium 3.5 as a base model for fine-tuning, the bigger question is whether the modified MIT terms allow the specific commercial path the startup is on. Companies building hosted competitor services to Le Chat will hit license friction. Companies fine-tuning the model for an internal product or for a vertical customer will not.

The Other Side

Not every signal on Medium 3.5 is positive.

The early benchmark coverage from independent reviewers has been mixed. An 18-task evaluation in Towards AI scored Medium 3.5 ahead of Devstral 2 on coding but behind on certain reasoning tasks where Magistral's specialization had been doing real work. That is the cost of merging: a single set of weights cannot beat dedicated specialists on every axis. Mistral's argument is that the operational simplicity and lower total cost outweigh the per-task ceiling. That tradeoff is real but customer-specific.

The other concern is that Mistral is shipping a flagship while OpenAI shipped GPT-5.5-Cyber to thousands of cyber defenders the same week, Anthropic is fighting an exclusion from the Pentagon's eight-company classified-network deal, and DeepSeek V4 dropped a frontier-class model at one-twentieth the cost of Claude Opus on the API. Medium 3.5 has to compete in that field, and the company's own framing of "near-frontier at a runnable size" is the most honest pitch available.

The third concern is one Mistral has not addressed: training data. The launch post says nothing about the dataset behind Medium 3.5, which is consistent with industry practice but unsatisfying for enterprise customers who need to evaluate the model for IP-clean output. The open weights are open. The training corpus is not.

The Bottom Line

Mistral Medium 3.5 is the most interesting open-weight release of the year so far, not because it crosses the proprietary frontier, but because it is the first time a Western lab has shipped a single dense 128B model that is genuinely good at coding, reasoning, and chat at the same time, and put the weights on Hugging Face under terms most enterprises can actually live with.

For ML practitioners, the calculus is straightforward. If you are running production AI workloads above 50 million tokens per day, Medium 3.5 changes the math on whether to keep paying frontier-lab API margins or to self-host. If you are running below that volume, the API at 1.50 in / 7.50 out keeps you on Mistral's infrastructure and on a model that is competitive on coding without being best-in-class.

The strategic message from Mistral is the more interesting one. The company spent two years differentiating on specialist models and just retired two of them in favor of one. That is a bet on consolidation, on configurable inference, and on the idea that "open-weight" is the only durable competitive moat a Western lab outside the top three can plausibly defend. The 4-GPU self-hosting threshold and the modified MIT license are how that bet gets paid out.

As the launch post quietly noted: "It is the model that made async cloud agents in Vibe practical to ship." The merger was not just a marketing simplification. It was the precondition for the product.

Sources

Practice with real Logistics & Shipping data

90 SQL & Python problems · 15 industry datasets

250 free problems · No credit card

See all Logistics & Shipping problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths