Models & Researchdeepseekmixture of expertsopen sourcekv cache

DeepSeek releases open-source V4 large models

|April 24, 2026

7.1

Relevance Score

DeepSeek releases open-source V4 large models — Photo: d15shllkswkct0.cloudfront.net · rights & takedowns

SiliconANGLE reports that Chinese AI developer DeepSeek released an open-source large language model family called V4 on April 24, 2026. The launch comprises two models, V4-Pro and V4-Flash, and uses a mixture-of-experts (MoE) architecture, according to SiliconANGLE. The flagship V4-Pro reportedly contains 1.6 trillion parameters and activates 49 billion parameters per inference; V4-Flash reportedly contains 284 billion parameters and activates 13 billion at a time, SiliconANGLE reports. The new series introduces a hybrid attention mechanism and KV-cache compression that SiliconANGLE says reduces KV-cache memory use by 90% versus DeepSeek's prior-generation models. SiliconANGLE also reports V4 includes training optimizations such as an mHC data-routing feature and a software module called Muon to optimise hidden-layer behaviour.

What happened

SiliconANGLE reports Chinese AI developer DeepSeek released an open-source large language model family named V4 on April 24, 2026. Per SiliconANGLE, the V4 lineup includes two models at launch: V4-Pro and V4-Flash. SiliconANGLE reports V4-Pro has 1.6 trillion parameters and activates 49 billion parameters when answering prompts, while SiliconANGLE reports V4-Flash contains 284 billion parameters and activates 13 billion parameters during inference. SiliconANGLE reports the family uses a mixture-of-experts (MoE) architecture and introduces a hybrid attention mechanism and KV-cache compression. SiliconANGLE reports the V4 KV-cache uses 90% less memory in inference than DeepSeek's previous-generation models. SiliconANGLE also reports V4 includes training-focused features named mHC and a software module called Muon.

Technical details

Industry context: Mixture-of-experts architectures enable very large parameter counts while limiting the active compute per token by routing to a subset of experts. The reported activation pattern, very large global parameter counts with much smaller per-query active parameter budgets, matches common MoE design trade-offs in recent frontier work. Hybrid attention plus KV-cache compression, as reported by SiliconANGLE, targets the common practitioner pain point of inference memory for long-context workloads.

Context and significance

Editorial analysis: Open-sourcing a family that combines MoE scaling, KV-cache compression, and training-route optimizations is notable for researchers and engineers tracking efficient scaling techniques. MoE releases and KV-compression experiments affect choices around inference cost, deployment footprint, and long-context application design. Observers building long-context agents or retrieval-augmented applications will find the claimed 90% KV-cache reduction particularly relevant for memory-constrained deployments.

What to watch

For practitioners: monitor independent benchmarks and reproduction efforts that validate the reported 1.6 trillion parameter scale, per-query active parameter counts, and the KV-cache memory reduction. Also watch for availability of model weights, licensing terms in the open-source release, published training recipes, and community evaluations of mHC and Muon components.

Scoring Rationale

Open-source release of a MoE model family at the reported trillion-parameter scale with claimed **90%** KV-cache savings is notable for practitioners. It is not a paradigm-shifting frontier release but is likely to spur engineering work on efficient inference and long-context applications.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems

Models & Researchdeepseekmixture of expertsopen sourcekv cache