MOREH Demonstrates LLM Inference on Tenstorrent Galaxy

According to a PR Newswire release distributed via Manila Times and Yahoo Finance, AI infrastructure vendor Moreh validated LLM inference on the Tenstorrent Galaxy Wormhole system using its proprietary MoAI Inference Framework. The release reports that Moreh tested across Mixture-of-Experts models including GPT-OSS, Qwen, GLM, and DeepSeek, achieving inference performance on Tenstorrent Galaxy Wormhole that matches or surpasses NVIDIA DGX A100-class systems. The company also reported improved cost efficiency by using a disaggregated serving architecture that pairs GPUs with Tenstorrent Wormhole processors as prefill accelerators, reducing reliance on high-cost HBM. The results were first unveiled at Tenstorrent's TT-Deploy launch event in San Francisco on May 1, 2026. Moreh CEO Gangwon Jo is quoted in the release on reaching a "significant milestone" for production-grade LLM inference on Tenstorrent-based systems.
What happened
According to a PR Newswire release distributed May 1-2, 2026, Moreh validated LLM inference performance on the Tenstorrent Galaxy Wormhole system using its proprietary MoAI Inference Framework. The release states Moreh tested a set of Mixture-of-Experts models including GPT-OSS, Qwen, GLM, and DeepSeek, and that the company achieved inference performance on Tenstorrent Galaxy Wormhole matching or surpassing NVIDIA DGX A100-class systems. The release also reports that Moreh presented a live demo at Tenstorrent's TT-Deploy event in San Francisco on May 1, 2026. The PR release includes direct quotes from Moreh CEO Gangwon Jo, including: "Achieving production-grade LLM inference performance and stability on Tenstorrent-based systems marks a significant milestone," and "We will continue to enhance performance through deeper optimization across heterogeneous architectures and closer integration with Tenstorrent NPUs."
Technical details
Per the PR release, Moreh's MoAI Inference Framework is described as a disaggregated inference solution that unifies heterogeneous GPUs and NPUs, naming NVIDIA, AMD, and Tenstorrent as supported architectures. The release characterizes the Tenstorrent processors as dedicated prefill accelerators that reduce reliance on high-bandwidth memory (HBM) by offloading prefill workloads, which the company says lowers overall infrastructure costs.
Industry context
Editorial analysis: Companies integrating NPUs with GPUs for inference are following a broader industry pattern of heterogeneous-disaggregated serving to reduce HBM-driven cost and memory bottlenecks. Observers note that using dedicated accelerators for prefill or attention-related stages can reduce peak memory pressure and allow larger working sets on lower-cost memory tiers.
Editorial analysis: Demonstrations that compare to DGX A100-class performance, when sourced from vendor or partner press releases, are useful signal events but require independent benchmarks for production procurement decisions. Press releases typically highlight favorable test selections; neutral third-party replication remains the gold standard for procurement.
What to watch
Editorial analysis: Practitioners and infrastructure teams will watch for third-party benchmark data, workload reproducibility on Tenstorrent Galaxy Wormhole at scale, and integration details for orchestration and model-serving stacks. Observers should also monitor memory utilization profiles and end-to-end latency under realistic request patterns to validate the claimed cost-efficiency gains.
Scoring Rationale
A notable infrastructure demonstration showing heterogeneous GPU-NPU serving with claimed parity to DGX A100-class systems. This is relevant to practitioners evaluating alternative inference hardware, but the evidence is from a vendor press release and needs independent benchmarks for higher significance.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

