NVIDIA Blackwell delivers 150+ tokens/sec/user on DeepSeek-V4-Pro out of the box

NVIDIA’s developer blog covers the Blackwell platform’s support for DeepSeek-V4 on day zero of the model’s release, including initial throughput numbers and the full set of deployment options available from managed endpoints to self-hosted serving frameworks. The post positions DeepSeek-V4 as the kind of workload Blackwell was designed for: long-context, high-parameter inference where KV cache memory and per-token FLOP cost are the primary constraints.

DeepSeek-V4-Pro is the larger model in the family at 1.6T total parameters with 49B active parameters. DeepSeek-V4-Flash is smaller at 284B total parameters with 13B active, targeting higher-speed and higher-efficiency workloads. Both support a 1M-token context window, which the NVIDIA post describes as opening new possibilities for long-context coding, document analysis, retrieval, and agentic AI workflows.

Architecture and why it matters for inference economics

The V4 family builds on the DeepSeek MoE architecture with an increased focus on the attention component. According to the post, the architectural innovations achieve a 73% reduction in per-token inference FLOPs and a 90% reduction in KV cache memory burden compared with DeepSeek-V3.2.

The mechanism is hybrid attention, combining two approaches. Compressed Sparse Attention (CSA) uses dynamic sequence compression to reduce KV cache memory footprint, then applies DeepSeek Sparse Attention to sparsify attention matrices and reduce computational overhead. Heavily Compressed Attention (HCA) applies more aggressive compression, consolidating KV entries across sets of tokens into a single compressed entry, resulting in significant additional KV cache reduction.

The post makes an infrastructure argument alongside the architectural one. As context windows grow, attention and KV cache become major bottlenecks. Agents carry system instructions, tool outputs, retrieved context, code, logs, memory, and multi-step reasoning traces — not just a single prompt and response. Long context is becoming a core requirement, not an edge case. The NVIDIA framing is that this architectural shift changes inference economics and shifts the enterprise focus from model selection to infrastructure strategy.

Blackwell performance numbers

Out-of-the-box testing of DeepSeek-V4-Pro on NVIDIA GB200 NVL72 demonstrates over 150 tokens per second per user, according to the post. NVIDIA also used vLLM’s Day 0 NVIDIA Blackwell B300 recipe to produce a throughput snapshot across the pareto of latency and throughput configurations. The post notes that these numbers are expected to improve as NVIDIA optimizes its stack, specifically citing Dynamo, NVFP4, optimized CUDA kernels, and advanced parallelization techniques.

Both DeepSeek-V4-Pro and DeepSeek-V4-Flash are available through NVIDIA GPU-accelerated endpoints on build.nvidia.com as part of the NVIDIA Developer Program. The post describes these hosted endpoints as a fast path to prototype with the latest models before moving to self-hosted deployment. DeepSeek-V4 is also available through NVIDIA NIM on day zero, enabling deployment with familiar API patterns.

Serving framework options

Two open-source serving frameworks are covered for teams moving to self-hosted deployment. SGLang offers three primary serving recipes for DeepSeek-V4 on Blackwell and Hopper hardware, each tuned for a different latency-throughput profile — low-latency, balanced, and max-throughput — along with specialized recipes for long-context workloads and for prefill/decode disaggregation. vLLM provides single-node and multinode recipes for Blackwell and Hopper, including multinode prefill/decode disaggregation scaling to 100 or more GPUs, with support for tool calling, reasoning, and speculative decoding.

Agentic workflow integrations

The post identifies three agent harnesses that can be configured to use DeepSeek-V4 as the underlying LLM. NVIDIA NemoClaw allows running OpenClaw in a secure OpenShell environment for tasks like code generation, personal assistance, and autonomous support. The NVIDIA AI-Q Blueprint, based on LangChain Deep Agents, is described as a deep research assistant that is extensible to add DeepSeek-V4 for orchestration and planning. The NVIDIA Data Explorer Agent, which the post notes won first place in the DABstep benchmark, is written with NeMo Agent Toolkit and supports switching to DeepSeek-V4.

The post closes with a statement about the competitive value of the combination: “The best part of using open agent harnesses and open models is you’re always able to try new models to pick up the bleeding edge.” For teams evaluating inference infrastructure for long-context, agentic workloads, the combination of V4’s architectural efficiency improvements and Blackwell’s hardware capacity represents the current state of what is achievable at the high end.