End-to-end FP8 in RL training: NeMo RL achieves 48% speedup over BF16 baseline

Reinforcement learning for language models splits into two distinct phases: a generation phase with tight latency requirements and a training phase that needs high throughput. Running both at high efficiency while maintaining training accuracy is harder than it looks. NVIDIA’s post on NeMo RL describes how applying FP8 end-to-end — rather than only during generation or only during training — solves an accuracy problem that partial quantization creates and delivers meaningful throughput gains.

The core result: end-to-end FP8 on linear layers achieves a consistent greater than 15% throughput improvement over BF16. With FP8 extended to KV cache and attention, total speedup reaches approximately 48% over the BF16 baseline.

The numerical disagreement problem

RL pipelines typically use separate engines. NVIDIA NeMo RL uses vLLM for rollouts and Megatron Core for training. Each uses distinct CUDA kernels. This introduces numerical differences between the two systems, which the post quantifies as token multiplicative probability error — the mean exponentiated absolute difference between log-probabilities from the training framework and log-probabilities from the inference framework.

A perfect score is 1.0. Acceptable values in practice are below 1.03 to 1.05. The problem with partial FP8 (applying it only to generation, keeping training in BF16) is that the quantization in the generation engine creates a distribution mismatch that the BF16 training engine does not see. This widens numerical disagreement.

The team tried three configurations:

Baseline: BF16 in both generation and training. Lowest disagreement.
Candidate 1: FP8 only in generation, BF16 for training. Disagreement rises.
Final recipe: FP8 in both generation and training. Disagreement drops compared to candidate 1.

The final recipe does not match the BF16 baseline’s agreement level, but importance sampling closes that remaining gap. The post reports that for candidate 1 (mixed precision), importance sampling narrows the accuracy gap but cannot close it. For the full FP8 recipe, importance sampling completely closes the gap from BF16 training. This is the key insight: the symmetric use of FP8 in both engines makes the distribution mismatch correctable, where the asymmetric case is not.

Results on dense and MoE models

On Llama 3.1 8B Instruct trained with GRPO on a math dataset to 4,000 steps, the FP8 recipe achieves a consistent greater than 15% throughput improvement over BF16 with matching accuracy after importance sampling is applied.

The post explains why the speedup is not the theoretical 2x. FP8 provides 2x peak throughput only for linear layers. Attention layers, normalization, non-linear functions, and output projections remain in BF16. The additional quantization kernels inserted before linear layers also add overhead. The 15-25% observed speedup is described as matching standalone vLLM test results, and the post notes that further optimizations such as fusing quantization kernels in vLLM could push the speedup to approximately 1.25x.

For Qwen3-30B, a mixture-of-experts model, similar experiments show matching accuracy curves between FP8 and BF16. The post notes that speed gain for the MoE case is still being investigated.

FP8 for KV cache and attention

Linear layers are not the only bottleneck in RL workflows with long output sequences. KV cache growth and attention computation often dominate rollout time, saturating memory bandwidth. This motivated extending FP8 to the KV cache and attention operations using per-tensor scaling.

The challenge specific to RL is that policy weights change at every training step. Static inference can calibrate quantization scales once. RL requires the scales to track the current policy state across rollouts and training updates.

NeMo RL handles this with a three-step approach. At the end of each training step, the trainer recalibrates Query, Key, and Value scales using the updated policy weights, with the training data (prompts and responses) providing the calibration distribution. Those scales are then synchronized to vLLM for the next rollout phase. The calibration overhead is reported as approximately 2-3% of total step time.

The results on Qwen3-8B-Base using GRPO show that FP8 for KV cache and attention combined with linear layer FP8 achieves validation accuracy alignment with the BF16 baseline and the linear-only FP8 configuration, after token-level truncated importance sampling is applied. The speedup from adding KV cache and attention quantization on top of linear FP8 is approximately 30% on the rollout stage, bringing the total improvement to approximately 48% over the BF16 baseline. The gains are described as particularly pronounced at longer response lengths, where attention computation is a larger fraction of total work.

Implementation notes

The FP8 recipe used in this work is derived from the block-wise quantized FP8 approach described in the DeepSeek-V3 Technical Report. Linear layers use FP8 math; all other modules remain in BF16. Configuration is handled through NeMo RL’s config map, with kv_cache_dtype in the vLLM configuration triggering automatic QKV recalibration and synchronization.

The 48% total speedup represents a meaningful reduction in RL compute cost. For workloads where the generation phase is memory-bandwidth-bound, the KV cache quantization provides a direct path to higher token throughput without sacrificing the accuracy that the training phase requires.