Infrastructureinference acceleratormemory centric computed matrixsram first

D-Matrix Ships Corsair Inference Accelerator, Claims 10x Token Speed

||By LDS Team
7.2
Relevance Score
D-Matrix Ships Corsair Inference Accelerator, Claims 10x Token Speed
Photo: bitrebels.com · rights & takedowns

d-Matrix announced on June 9, 2026 that its Corsair inference accelerator has entered full production and is shipping in volume to hyperscalers, neoclouds, and frontier AI labs, with the company claiming rack-level Corsair-plus-GPU configurations generate tokens roughly 10x faster than GPU-only setups, at about 3x lower cost and up to 5x lower energy use. For ML infrastructure teams, Corsair's SRAM-first, memory-centric architecture offers a new point in the inference latency-cost tradeoff, shifting the bottleneck from DRAM bandwidth toward on-chip memory capacity and software locality; independent testing by Gimlet Labs found response times cut from 24 seconds to under two seconds when pairing Corsair with GPUs for speculative decoding.

d-Matrix's Corsair launch is a concrete signal that memory-centric, SRAM-first architectures are moving from lab benchmarks to production hyperscaler racks, not just another accelerator announcement. For infrastructure teams evaluating 2026-2027 inference procurement, Corsair matters less for its headline 10x claim than for what it represents: a viable alternative memory hierarchy for the token-decode phase of LLM inference, backed by real production commitments and, per CNBC, by Microsoft as an investor.

What happened

d-Matrix said on June 9 that its Corsair inference accelerator has entered full production and begun shipping in volume to priority hyperscalers, neoclouds, and frontier AI labs, according to the company's announcement and a same-day PR Newswire release. Founder and CEO Sid Sheth said, "We built Corsair specifically for this moment, the Age of AI Inference." The company claims rack-level configurations pairing Corsair accelerators with GPUs generate tokens about 10x faster than GPU-only setups, at roughly 3x lower cost and up to 5x lower energy use for certain latency-sensitive inference workloads. Corsair is built on a TSMC 6-nanometer (N6) node in partnership with Alchip Technologies and is packaged as a server-pluggable, PCIe-based unit; d-Matrix's SquadRack reference design involves partners including Supermicro, Arista, and Broadcom.

Technical context

The core technical claim rests on d-Matrix's Digital In-Memory Computing architecture, which pairs large SRAM pools directly with compute logic rather than relying on HBM/DRAM and PCIe transfers for the decode phase of inference. That design targets the industry's memory-wall bottleneck for real-time, low-latency LLM serving, at the cost of lower on-chip capacity for very large models unless mitigated by chiplet scaling or model partitioning. Gimlet Labs, an independent partner, published its own testing showing a 2-10x reduction in end-to-end request latency for speculative decoding when pairing Corsair with GPUs, reporting a drop from a 24-second baseline response to under two seconds in one configuration; this is the closest available third-party validation of d-Matrix's throughput claims to date.

Market context

CNBC reported that Microsoft is a d-Matrix investor through its M12 venture arm, and that the company has raised roughly $500 million to date at about a $2 billion valuation. Sheth told CNBC he sees a "$1 trillion market in the making" for inference compute and said he has no intention of selling the company. That backing, along with the shift toward agentic AI workloads that push inference volumes beyond what GPU-only infrastructure was built for, forms the demand backdrop for Corsair's production ramp.

For practitioners

Adopting a Corsair-like device means designing for heterogeneous rack-level orchestration: co-placing GPU-based prefill with Corsair-based decode, and running software that supports speculative decoding, sharding, or operator offload. d-Matrix's own materials describe this disaggregated prefill/decode pattern, already being explored elsewhere in inference research. Expect added system-level testing burden and a need for custom schedulers or runtime integration to realize advertised latency gains, rather than a drop-in replacement for existing GPU fleets.

What to watch

Independent, reproducible benchmarks beyond vendor materials and the Gimlet Labs test remain the key gap: watch for third-party latency and throughput results on representative stacks (for example Llama-70B inference at scale), details on Corsair's SRAM capacity and chiplet sizing, cloud-provider pricing and availability timelines, and whether major model-serving toolchains add native support for disaggregated prefill/decode pipelines.

Key Points

  • 1d-Matrix's Corsair inference accelerator entered full production June 9, 2026, shipping to hyperscalers and frontier AI labs with claimed 10x faster token generation than GPUs.
  • 2Microsoft backs d-Matrix through its M12 venture arm; the startup has raised about $500 million at a roughly $2 billion valuation, per CNBC.
  • 3Independent Gimlet Labs testing found meaningful latency cuts pairing Corsair with GPUs, but broader third-party benchmarks are still needed to confirm vendor performance claims.

Scoring Rationale

A production-stage, memory-centric inference accelerator shipping to hyperscalers is a notable infrastructure development, strengthened by independent CNBC confirmation of Microsoft's M12 backing and by Gimlet Labs' third-party latency testing. Headline 10x/3x/5x performance claims remain vendor-sourced pending broader independent benchmarking, keeping this just below the 'major' tier.

Sources

Public references used for this report.

6 sources

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems