DeepSeek V4 cuts inference costs sharply and validates on Huawei's Ascend accelerators

DeepSeek has released V4, a new open-weights large language model available in two variants. The smaller is a 284 billion parameter Flash mixture-of-experts model with 13 billion active parameters. The larger is a 1.6 trillion parameter Pro model with 49 billion active parameters in use at any given moment. Both are in preview, available for download on Hugging Face and via DeepSeek’s API and web service.

The models introduce architectural changes aimed at reducing inference cost and memory requirements. According to The Register’s reporting, V4 also extends validated support to Huawei’s Ascend family of AI accelerators — a development given the constraints on Nvidia GPU availability in China.

What changed architecturally

The most significant technical change in V4 is to how the models handle attention. DeepSeek researchers describe a hybrid attention mechanism that combines Compressed Sparse Attention and Heavy Compressed Attention to reduce compute required during inference and shrink the key-value caches used to track model state. These KV caches are a constraint in large-scale inference deployments, as they consume substantial memory and are often offloaded to system memory or flash storage.

Combined, these changes allow V4 to support a one million token context window while using between 9.5x and 13.7x less memory than DeepSeek V3.2, according to The Register.

DeepSeek is also continuing its use of lower precision datatypes. V3 was among the first open-weights models trained at FP8. V4 uses a mixture of FP8 and FP4 precision, with quantization-aware training applied specifically to the MoE expert weights. FP4 roughly halves the memory required to store model weights compared to FP8, at the cost of reduced numerical precision.

V4 introduces a new optimizer called Muon, designed to speed up convergence and improve training stability. V4-Pro was trained on 33 trillion tokens.

DeepSeek claims V4-Pro outperforms all open-weight LLMs while matching leading proprietary Western models across its benchmark suite. The Register notes these benchmark claims should be treated with caution: “just because it performs well in canned benchmarks doesn’t mean it’ll hold up in real world applications” and “benchmarks don’t tell the full story.” Novel architectural choices add uncertainty that only deployment at scale will resolve.

Huawei hardware validation

The Huawei angle is the least detailed part of the release. The V4 paper mentions that DeepSeek validated its expert parallelism scheme on both Nvidia GPUs and Huawei Ascend NPU platforms. This does not mean V4 was trained on Huawei hardware — The Register interprets it as validation for inference and possibly reinforcement learning post-training, not necessarily pre-training.

DeepSeek previously attempted to train models on Huawei’s chips but reportedly ran into problems with chip quality, interconnect speed, and an immature software stack, which drove the company back to Nvidia hardware for training.

The FP4 precision used in V4 has led some observers to assume DeepSeek obtained Nvidia Blackwell accelerators — which the US government has prohibited from sale in China. The Register pushes back on this: Hopper GPUs, which DeepSeek does have access to, cannot accelerate FP4 computationally but can use it for weight storage only. This reduces memory footprint and bandwidth requirements during training and inference without requiring Blackwell hardware.

Pricing

DeepSeek is pricing API access to the Flash model at $0.14 per million input tokens and $0.28 per million output tokens. The Pro model is $1.74 per million input tokens and $3.48 per million output tokens. The Register notes for comparison that OpenAI charges $5 per million input tokens and $30 per million output tokens for GPT-5.5.

Both base and instruction-tuned versions of both models are available for download.