End-to-end FP8 in RL training: NeMo RL achieves 48% speedup over BF16 baseline
NVIDIA NeMo RL applies FP8 precision across both generation and training phases of reinforcement learning, closing accuracy gaps via importance sampling and adding 48% total speedup when KV cache and attention are also quantized.