hardware

4 stories

End-to-end FP8 in RL training: NeMo RL achieves 48% speedup over BF16 baseline

NVIDIA NeMo RL applies FP8 precision across both generation and training phases of reinforcement learning, closing accuracy gaps via importance sampling and adding 48% total speedup when KV cache and attention are also quantized.

Apr 25, 2026

1 source · primary

hardware Official

NVIDIA adds Muon optimizer support to Megatron Core, closes gap with AdamW at scale

NVIDIA has integrated the Muon higher-order optimizer into Megatron Core and NeMo, showing minimal throughput loss versus AdamW on GB300 hardware while enabling training at thousands of GPUs.

Apr 25, 2026

1 source · primary

hardware Official

NVIDIA FLARE reduces federated learning migration to ~5 lines of code and an environment swap

The latest NVIDIA FLARE API splits federated learning adoption into two steps: a minimal client API that adds federation to existing training scripts without restructuring them, and portable job recipes that run unchanged from simulation to production.

Apr 25, 2026

1 source · primary

hardware Official

NVIDIA Blackwell delivers 150+ tokens/sec/user on DeepSeek-V4-Pro out of the box

NVIDIA outlines how its Blackwell platform and NIM microservices support DeepSeek-V4's million-token context requirements, with initial GB200 NVL72 benchmarks and deployment paths via SGLang, vLLM, and hosted endpoints at build.nvidia.com.

Apr 25, 2026

1 source · primary