DeepSeek-V4 cuts KV cache to 2% of standard cost to make million-token agent context practical

DeepSeek has released its fourth-generation flagship model family, and the Hugging Face writeup frames the motivation plainly: running a frontier open model as an agent breaks in predictable ways. The model stalls mid-task, context budgets overflow, KV cache fills the GPU, and tool-call quality degrades over long trajectories. V4 is designed to fix those specific failures, not to set general benchmark records.

Two checkpoints ship: DeepSeek-V4-Pro at 1.6T total parameters with 49B active, and DeepSeek-V4-Flash at 284B total with 13B active. Both support a 1M-token context window. The architectural work targets the cost of actually using that window rather than just advertising it.

The KV cache problem and what V4 does about it

For an agent running a long tool-use trajectory — a SWE-bench task, a multi-step browsing session, or a terminal session with hundreds of commands — every tool result appends to the context and every subsequent token pays the full attention cost against everything before it. Two numbers determine feasibility: per-token inference FLOPs and KV cache size, both of which grow with sequence length.

At 1M tokens, DeepSeek-V4-Pro requires 27% of the single-token inference FLOPs compared with DeepSeek-V3.2 and uses 10% of the KV cache memory. V4-Flash drops further to 10% of FLOPs and 7% of KV cache. Against a reference architecture using grouped query attention with 8 heads stored in bfloat16, the post reports that DeepSeek-V4 requires roughly 2% the cache size.

The efficiency comes from a hybrid attention design that interleaves two mechanisms across layers. Compressed Sparse Attention (CSA) compresses KV entries by 4x along the sequence dimension using softmax-gated pooling with a learned positional bias, then a lightweight indexer running in FP4 selects the top-k compressed blocks per query. Heavily Compressed Attention (HCA) applies 128x compression and replaces sparse selection with dense attention over the compressed stream — the compressed sequence is short enough that dense attention stays cheap. In V4-Pro’s 61-layer stack, layers 0–1 run HCA, layers 2–60 alternate between CSA and HCA, and the final MTP block uses sliding-window only. Both paths use FP8 storage for most KV entries, with BF16 reserved for RoPE dimensions.

Agent-specific post-training decisions

Efficient long-context attention is necessary for agent workflows but not sufficient. The post describes three additional choices that directly target agentic use cases.

The first is reasoning preservation across tool-call turns. In V3.2, reasoning traces accumulated across tool-result rounds but were discarded whenever a new user message arrived. For multi-turn agentic workflows where the user sends a follow-up after the agent has already chained several tool calls, this meant losing accumulated context and forcing reconstruction. V4 preserves reasoning content across user message boundaries when the conversation contains tool calls, maintaining a coherent chain of thought across the full task. For conversational use without tools, the prior behavior is preserved — reasoning flushes at each turn to keep context concise.

The second is a revised tool-call schema. V4 introduces a dedicated special token and an XML-based format that separates string parameters from structured parameters. The post identifies JSON-in-string tool calls as a common failure mode when models emit nested quoted content, and notes that the new schema removes a class of parsing errors around numbers and booleans that JSON formats routinely produce.

The third is the training infrastructure itself. Agent behavior was trained with reinforcement learning against real tool environments using DeepSeek Elastic Compute (DSec), a Rust platform that exposes function calls, containers, microVMs (Firecracker), and full VMs (QEMU) behind a single Python SDK. The post identifies three DSec features as critical for agent training: fast image loading via layered storage so RL rollouts do not wait on container startup, preemption-safe trajectory replay so interrupted training steps resume without re-running tool calls, and a uniform API across substrates.

Benchmark results and availability

On agent-specific benchmarks, V4-Pro-Max scores 80.6 on SWE Verified and 73.6 on MCPAtlas Public. On TerminalBench 2.0 it scores 67.9 and on Toolathlon 51.8. Long-context retrieval holds above 0.82 MRCR 8-needle accuracy through 256K tokens and falls to 0.59 at 1M tokens. In an internal R&D coding benchmark covering PyTorch, CUDA, Rust, and C++ tasks, V4-Pro-Max hits 67% pass rate. In a survey of 85 DeepSeek developers using V4-Pro as their daily driver, 52% said it was ready to replace their current primary coding model and 39% leaned toward yes.

Four checkpoints are on the Hugging Face Hub. Instruct models use FP4 for MoE expert weights and FP8 elsewhere. Base models are FP8 throughout. The architecture is a useful data point for any team running long-horizon agent tasks at scale: the design explicitly prioritizes sustained inference efficiency over peak benchmark scores, and the infrastructure investments that enable it — compressed attention, agent-preserving reasoning, reliable tool-call parsing, and a sandboxed RL environment — are each addressed as a distinct engineering problem rather than a single architectural choice.