DeepSeek V4 arrives: near-frontier performance at lower cost

Chinese AI lab DeepSeek has released the first models in its V4 series: DeepSeek-V4-Pro and DeepSeek-V4-Flash, both available under an MIT license on Hugging Face. The pair follow DeepSeek’s V3.2 and V3.2 Speciale. Both new models support 1 million token context windows and use a Mixture of Experts architecture.

V4-Pro has 1.6 trillion total parameters with 49 billion active. V4-Flash has 284 billion total parameters with 13 billion active. According to Simon Willison’s write-up, V4-Pro is likely the new largest open-weights model available — larger than Kimi K2.6 at 1.1 trillion parameters, GLM-5.1 at 754 billion, and more than twice the size of DeepSeek’s own V3.2 at 685 billion. The raw file sizes on Hugging Face reflect this: Pro weighs in at 865GB, Flash at 160GB.

Pricing

Willison calls out the cost as significant. DeepSeek’s pricing for V4-Flash is $0.14 per million input tokens and $0.28 per million output tokens. For V4-Pro, it is $1.74 per million input and $3.48 per million output.

A comparison in the post places those numbers in context. V4-Flash at $0.14 input is cheaper than GPT-5.4 Nano ($0.20). V4-Pro at $1.74 input undercuts Claude Sonnet 4.6 ($3.00). As Willison writes: “DeepSeek-V4-Flash is the cheapest of the small models, beating even OpenAI’s GPT-5.4 Nano. DeepSeek-V4-Pro is the cheapest of the larger frontier models.”

That pricing is possible in part because of what DeepSeek has done with efficiency, particularly for long-context workloads. The paper reports that in a 1 million token context scenario, V4-Pro achieves only 27% of the single-token FLOPs and 10% of the KV cache size relative to V3.2. V4-Flash pushes further: 10% of the single-token FLOPs and 7% of the KV cache compared with V3.2.

Where the model sits relative to frontier

DeepSeek’s self-reported benchmarks describe a model that is competitive with frontier offerings, though with a stated acknowledgment of a gap. According to the paper, as Willison cites it, V4-Pro-Max “demonstrates superior performance relative to GPT-5.2 and Gemini-3.0-Pro on standard reasoning benchmarks” but “its performance falls marginally short of GPT-5.4 and Gemini-3.1-Pro, suggesting a developmental trajectory that trails state-of-the-art frontier models by approximately 3 to 6 months.”

Running locally

Willison tested both models via OpenRouter using his llm-openrouter plugin. He notes that a quantized version of Flash might fit on his 128GB M5 MacBook Pro, and that Pro could potentially run if active experts are streamed from disk — though he describes this as speculative.

He says he is watching Unsloth’s Hugging Face page for quantized versions, which the Unsloth team typically produces quickly for large open-weights releases. The 160GB Flash file is the more plausible starting point for local use; the 865GB Pro would require either substantial RAM or creative streaming approaches.

The MIT license means both models can be used without the usage restrictions that apply to some other open-weights releases, which matters for commercial deployment and fine-tuning.