Inception Labs Launches Mercury 2 Diffusion LLM

Last week Inception Labs launched Mercury 2, a diffusion-based large language model that generates over 1,000 tokens per second and delivers five to ten times lower end-to-end latency than speed-optimized autoregressive models, CEO Stefano Ermon told The New Stack. Mercury 2 is available via an OpenAI-compatible API, with AWS Bedrock integration coming soon, targeting faster, cheaper inference for reasoning workloads.
Key Points
- 1Launches Mercury 2, a diffusion-based LLM producing over 1,000 tokens per second
- 2Highlights five-to-ten-times latency improvement versus optimized autoregressive models, leveraging parallel GPU computation
- 3Enables cheaper, faster inference for reasoning tasks; accessible via OpenAI-compatible API and Bedrock soon
Scoring Rationale
High novelty and usable release, scored high despite being a single-company claim with limited independent benchmarks.
Sources
Public references used for this report.
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

