DeepSeek V4 Delivers 1M-Token Context and 1T Parameters

China-based DeepSeek released DeepSeek V4, a claimed 1 trillion-parameter mixture-of-experts model with a native 1,000,000-token context window, multimodal inputs, and several architecture innovations. The release asserts highly efficient activation (around 32-49B activated parameters via MoE), three new techniques-Engram conditional memory, Manifold-Constrained Hyper-Connections, and the Muon optimizer-and native multimodal training on text, images, video, and audio. DeepSeek positions V4 as open-weight and dramatically cheaper than Western frontier models, with pre-release pricing estimates near $0.14 per 1M input tokens and runnable modes on consumer GPUs. Claims are based on DeepSeek internal benchmarks and pre-release artifacts; independent verification is pending, so practitioners should treat performance and cost numbers as directional until third-party evaluations appear.
What happened
DeepSeek released `DeepSeek-V4`, a claimed 1 trillion-parameter Mixture-of-Experts (MoE) model with a native 1,000,000-token context window, multimodal training, and three new architectural techniques. The announced family includes variants like `DeepSeek-V4-Pro` (reported 1.6T total, 49B activated) and `DeepSeek-V4-Flash` (reported 284B total, 13B activated). DeepSeek promotes open-weight licensing and very low API pricing compared with Western frontier models.
Technical details
The V4 family centers on MoE to keep per-token compute tractable, with reported active-parameter budgets around 32-49B per token. Core architectural additions are Engram conditional memory to decouple static facts from reasoning, Manifold-Constrained Hyper-Connections (mHC) to stabilize deep signal propagation, and the Muon optimizer for faster, more stable convergence. The team also describes a hybrid attention design combining Compressed Sparse Attention and Heavily Compressed Attention to lower KV cache and inference FLOPs in long-context regimes.
- •Key claimed specs and capabilities: 1,000,000 token context, native multimodal (text/image/video/audio), MoE with ~1T total params, 32-49B activated params, pretraining data > 32T tokens.
- •Performance claims: ~90% HumanEval, 80-85% on SWE-bench Verified in DeepSeek internal tests; DeepSeek asserts parity or superiority to some closed-source frontier models at a fraction of cost.
- •Operational claims: API input pricing projected near $0.14 per 1M tokens and optimizations targeting Huawei Ascend and Cambricon chips with runnable modes on consumer GPUs (dual RTX 4090 or single RTX 5090 reported).
Context and significance
If independent evaluations confirm these claims, `DeepSeek-V4` would reshape the cost-performance trade-offs for long-context and agentic workloads. Large context windows at million-token scale unlock new workflows for whole-codebase reasoning, long-form documents, and multimodal episodic memory without external retrieval augmentation. The openness and low pricing that DeepSeek promises would accelerate adoption in research, industry, and developer communities, and it would increase pressure on Western providers to lower prices or match long-context capabilities.
However, the current evidence is dominated by DeepSeek-controlled benchmarks, pre-release artifacts on model hubs, and promotional documentation. History shows pre-release numbers can be optimistic, especially for cross-benchmark comparisons. The engineering claims around Engram, mHC, and hybrid attention are intriguing and technically plausible, but their real-world robustness, inference latency, and GPU/accelerator memory trade-offs require independent stress tests.
What to watch
Early independent benchmarks on reasoning, coding, and multimodal tasks; real-world latency and memory behavior at million-token contexts; licensing and weight availability; and which hardware stacks get first-class optimization. Also watch Western responses on pricing, long-context support, and architectural countermeasures.
Scoring Rationale
This is a notable model release with potentially large implications for long-context and agentic applications because of the million-token window and claimed low-cost operation. Impact is tempered by reliance on internal benchmarks and pre-release artifacts, so independent validation is required before labeling it industry-shaking.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems


