Skip to content

Zyphra Trained an 8B Reasoning Model Entirely on AMD Chips. It Beat Claude 4.5 Sonnet on Math.

DS
LDS Team
Let's Data Science
9 min
ZAYA1-8B has 760 million active parameters, ships under Apache 2.0, and was pretrained on a cluster of 1,024 AMD Instinct MI300X GPUs with no Nvidia involvement. On HMMT 2025, it scored 89.6 against Claude 4.5 Sonnet's 88.3.

For the past three years, the unspoken rule of frontier AI training has been simple: if the cluster is not Nvidia, the model is not competitive. Every leading lab from OpenAI to Anthropic to Google DeepMind to Meta to Mistral has built its frontier runs on Hopper or Blackwell silicon. Even labs explicitly hedging on hardware have routed their biggest training jobs through Nvidia stacks.

On Wednesday, May 6, 2026, a startup called Zyphra published a model that broke the rule.

ZAYA1-8B is a Mixture-of-Experts language model with 8.4 billion total parameters and 760 million active. It was pretrained, midtrained, and supervised-fine-tuned end-to-end on 1,024 AMD Instinct MI300X GPUs connected via AMD Pensando Pollara networking, in a cluster Zyphra co-built with IBM. The model ships under the Apache 2.0 license on Hugging Face, with a serverless endpoint on Zyphra Cloud.

On the HMMT 2025 mathematics benchmark, ZAYA1-8B scored 89.6 against Claude 4.5 Sonnet's 88.3 and GPT-5-High at the same level, using a novel test-time compute scheme called Markovian RSA. On AIME 2025, the same scheme pushed it to 91.9. For comparison, DeepSeek-R1-0528, the highest-scoring open-weight reasoning model as of last quarter, sits at 87.5 on AIME 2025 with roughly 671 billion total parameters, two orders of magnitude more than ZAYA1-8B.

For the practitioner audience, the headline is not the leaderboard score. It is what the leaderboard score implies about hardware lock-in.

What Zyphra Actually Shipped

Zyphra is a small team. The company was founded in 2021 by Krithik Puthalath, Beren Millidge, Tomas Figliolia, and Danny Martinelli. It closed a 100 million dollar Series A in June 2025 at a one-billion-dollar post-money valuation, led by Jaan Tallinn, an early backer at DeepMind and Anthropic.

Until this week, Zyphra was best known for the Zamba family of small hybrid Mamba-Transformer models, which had a research following but no enterprise traction. ZAYA1-8B is the company's first model targeted at the frontier reasoning leaderboard.

The release package on Hugging Face includes:

  • Base weights for ZAYA1-8B (8.4 billion total parameters, 760 million active across 64 experts)
  • Reasoning post-trained weights trained with a four-stage reinforcement learning cascade
  • Apache 2.0 license, including commercial use
  • A technical report on arXiv (2605.05365) covering pretraining, the routing architecture, and the RL cascade

The Architecture Choices Are the Story

Most reasoning model releases ship a benchmark table and a brief blog post. Zyphra published a 50-plus-page technical report on arXiv that argues, in detail, why ZAYA1 should be read as a research bet on three specific architectural ideas.

InnovationWhat it doesWhy practitioners should care
Compressed Convolutional Attention (CCA)Performs sequence mixing in a compressed latent space, with 8x KV-cache reduction versus standard multi-head attentionLong-context inference at a fraction of the memory footprint, the bottleneck most reasoning workloads hit first
MLP-based router with PID-style bias balancingReplaces the standard linear router used in nearly every MoE with a multi-layer MLP that uses control-theory-inspired load balancingStabler training at MoE scale, fewer dead experts, which historically wastes 10 to 30 percent of an MoE's parameter budget
Learned residual scalingControls residual-stream norm growth through depth at near-zero parameter and FLOP costCleaner gradient propagation in deep stacks, which the report ties directly to the model's reasoning gains

The combined claim from the report is that these three ideas, taken together, produce roughly two times the intelligence per active parameter of the next-best open-weight reasoning model. That claim will be tested in the next two months as labs reproduce the architecture on their own data.

Markovian RSA Is the Test-Time Story

The benchmark numbers that make the headline are not the base model. They are the base model paired with Markovian RSA, a test-time compute scheme Zyphra introduced in the same paper.

The intuition is straightforward. Most reasoning models, including DeepSeek-R1 and the Anthropic and OpenAI thinking models, generate one long chain of thought, run a verifier or self-consistency vote, and stop. If the chain gets too long, the context window fills up and the model loses the thread.

Markovian RSA splits the reasoning budget into fixed-duration chunks. Inside each chunk, the model runs Recursive Self-Aggregation across multiple parallel traces. Between chunks, it carries forward only a 4,000-token tail. The result is that performance scales with compute budget rather than hitting the ceiling that fixed-context reasoners hit.

Applied to ZAYA1-8B, the scheme delivers:

BenchmarkBase ZAYA1-8BWith Markovian RSAReference: Claude 4.5 SonnetReference: DeepSeek-V3.2
AIME 202589.191.9ComparableLower
HMMT 2025Lower base89.688.3Lower
APEX-shortlist (high compute)Lower baseSurpasses DeepSeek-V3.2n/aReference point

The numbers Zyphra published put a model with 760 million active parameters within striking distance of frontier closed-source models that have at least an order of magnitude more active parameters at inference time. If those numbers replicate, the cost-per-correct-answer math on reasoning workloads changes substantially.

What Made This Possible on AMD

The hardware story is the one that will move enterprise procurement.

ZAYA1-8B was trained on a custom cluster Zyphra built with IBM. The cluster runs 1,024 AMD Instinct MI300X nodes wired together with AMD Pensando Pollara networking. The training stack includes Zyphra's own distributed-training kernels and a fork of Megatron tuned for ROCm, AMD's CUDA equivalent.

This matters for three reasons.

First, frontier reasoning training has historically required Nvidia interconnect plus CUDA-only kernels for the long-tail of operations that PyTorch does not yet ship for ROCm. Zyphra's report describes hitting high FLOP utilization on the MI300X cluster that is competitive with what frontier labs report on equivalent Nvidia clusters.

Second, MI300X has 192 GB of HBM3 per accelerator, compared to 80 GB on H100. For an MoE workload like ZAYA1, that capacity advantage maps directly to fewer expert-parallel splits and lower cross-node communication, which is the exact bottleneck that has historically punished AMD on training benchmarks.

Third, MI300X clusters list at meaningfully lower per-FLOP cost than H100 clusters of equivalent size. Zyphra has not published its training compute budget, but ZAYA1-8B is the company's first frontier reasoning model and was trained while operating from a single 100 million dollar Series A round. By comparison, frontier reasoning runs at the largest closed-source labs are widely reported in the hundreds of millions to low billions of dollars.

The proof point that Zyphra's release delivers, more than the benchmark line, is that a small lab can train a competitive reasoning model on a non-Nvidia cluster on a Series A budget. Whether AMD can scale that pattern to thousands of customers is now the question.

The Other Side

Three credible critiques surfaced in the days after the release.

The benchmark methodology is the most pointed concern. Scores driven by novel test-time compute schemes are not directly comparable to standard pass@1 numbers reported for closed-source models like Claude 4.5 Sonnet and GPT-5-High. Markovian RSA spends substantially more inference compute per problem than a single chain-of-thought call. The 89.6 HMMT score should be read as "what is achievable at high compute" rather than "what the base model knows."

The reproducibility question has not yet been answered. As of publication, no third-party group has posted independent evaluations of ZAYA1-8B on a held-out math benchmark. Until labs reproduce the headline numbers on their own evaluation harnesses, the gains are Zyphra-reported numbers, not consensus numbers.

ML engineers writing on Hacker News raised a more pragmatic concern. The MI300X-trained checkpoint loads cleanly on Nvidia hardware, since the architecture is hardware-agnostic at inference time, but the training stack does not. Reproducing the training run on AMD hardware requires Zyphra's custom kernels, Pollara networking, and a cluster topology that is not generally available outside of IBM Cloud. The "you can train this on AMD" story, they argued, currently means "Zyphra and IBM can train this on AMD."

Zyphra has said it will release the training kernels and configuration files in a follow-up paper. Until that happens, the hardware-portability claim is provisional.

What This Means for Practitioners

For data scientists and ML engineers, the practical takeaways break into three buckets.

For inference workloads, ZAYA1-8B is the first Apache-licensed open-weight reasoning model in the sub-1 billion active parameter class that can plausibly substitute for paid Claude or GPT API calls on math, coding, and structured reasoning tasks. Inference fits comfortably on a single MI300X, a single H100, or even on consumer-grade hardware with 4-bit quantization. The cost per token at production scale should land at a small fraction of what Claude Opus or GPT-5-High charge.

For research workloads, the technical report's three architectural ideas are the more interesting part of the release. Compressed Convolutional Attention, in particular, is a candidate for the next generation of long-context reasoning models. Expect labs at Meta, Mistral, and Hugging Face to publish ablations on CCA inside a quarter.

For procurement and capacity planning, the AMD training story is the one that will affect 2026 budgets. If a Series A startup can train a frontier-competitive reasoning model on 1,024 MI300X nodes, the argument for paying the Nvidia premium on training-only clusters gets harder to make. AMD's MI355X and MI400 roadmap, which slot in above MI300X on memory and bandwidth, become substantially more credible as a training target rather than just an inference one.

For more on how the cost calculus has been shifting on the closed-source side, see DeepSeek V4 matched frontier models on three benchmarks at one-twentieth the API cost of Claude Opus. For the broader picture on open-weight model momentum, see Mistral's 128B open-weight release that merged its coding and reasoning lines. For context on the AMD versus Nvidia training debate, see the OpenAI and Cerebras chip-order doubling that preceded the Cerebras IPO filing.

The Bottom Line

A small Series A startup just published a Mixture-of-Experts reasoning model that scores 89.6 on HMMT 2025, beats Claude 4.5 Sonnet on the same test, and was trained from first to last token on AMD silicon. It costs less than 100 million dollars to run. It ships under Apache 2.0.

If those facts hold up under peer reproduction, the consequences for the AI hardware market are larger than the consequences for the leaderboard. Nvidia's training monopoly has been the implicit assumption underwriting trillions of dollars in 2026 capex commitments. ZAYA1-8B is the first model in its weight class to challenge that assumption with a working artifact rather than a roadmap slide.

The technical report's last line is the one to remember. "Intelligence density," the authors write, "is now the design goal that matters most." On a 760-million-active-parameter model trained for a Series A budget, that line reads less like a slogan and more like a direct shot at the entire frontier-lab thesis: that the path to better models runs through bigger clusters, more Nvidia, and higher capital costs forever. ZAYA1-8B is an existence proof for the opposite bet.

Whether the bet pays off depends on what the next ten ZAYA-class releases look like. Wednesday was the first one.

Sources

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths