DeepSeek releases open-source V4 large models

SiliconANGLE reports that Chinese AI developer DeepSeek released an open-source large language model family called V4 on April 24, 2026. The launch comprises two models, V4-Pro and V4-Flash, and uses a mixture-of-experts (MoE) architecture, according to SiliconANGLE. The flagship V4-Pro reportedly contains 1.6 trillion parameters and activates 49 billion parameters per inference; V4-Flash reportedly contains 284 billion parameters and activates 13 billion at a time, SiliconANGLE reports. The new series introduces a hybrid attention mechanism and KV-cache compression that SiliconANGLE says reduces KV-cache memory use by 90% versus DeepSeek's prior-generation models. SiliconANGLE also reports V4 includes training optimizations such as an mHC data-routing feature and a software module called Muon to optimise hidden-layer behaviour.
What happened
SiliconANGLE reports Chinese AI developer DeepSeek released an open-source large language model family named V4 on April 24, 2026. Per SiliconANGLE, the V4 lineup includes two models at launch: V4-Pro and V4-Flash. SiliconANGLE reports V4-Pro has 1.6 trillion parameters and activates 49 billion parameters when answering prompts, while SiliconANGLE reports V4-Flash contains 284 billion parameters and activates 13 billion parameters during inference. SiliconANGLE reports the family uses a mixture-of-experts (MoE) architecture and introduces a hybrid attention mechanism and KV-cache compression. SiliconANGLE reports the V4 KV-cache uses 90% less memory in inference than DeepSeek's previous-generation models. SiliconANGLE also reports V4 includes training-focused features named mHC and a software module called Muon.
Technical details
Industry context: Mixture-of-experts architectures enable very large parameter counts while limiting the active compute per token by routing to a subset of experts. The reported activation pattern, very large global parameter counts with much smaller per-query active parameter budgets, matches common MoE design trade-offs in recent frontier work. Hybrid attention plus KV-cache compression, as reported by SiliconANGLE, targets the common practitioner pain point of inference memory for long-context workloads.
Context and significance
Editorial analysis: Open-sourcing a family that combines MoE scaling, KV-cache compression, and training-route optimizations is notable for researchers and engineers tracking efficient scaling techniques. MoE releases and KV-compression experiments affect choices around inference cost, deployment footprint, and long-context application design. Observers building long-context agents or retrieval-augmented applications will find the claimed 90% KV-cache reduction particularly relevant for memory-constrained deployments.
What to watch
For practitioners: monitor independent benchmarks and reproduction efforts that validate the reported 1.6 trillion parameter scale, per-query active parameter counts, and the KV-cache memory reduction. Also watch for availability of model weights, licensing terms in the open-source release, published training recipes, and community evaluations of mHC and Muon components.
Scoring Rationale
Open-source release of a MoE model family at the reported trillion-parameter scale with claimed **90%** KV-cache savings is notable for practitioners. It is not a paradigm-shifting frontier release but is likely to spur engineering work on efficient inference and long-context applications.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

