NVIDIA’s developer blog details how the company has built infrastructure to run the Muon optimizer and similar emerging optimizers at large scale inside Megatron Core. Muon — short for MomentUm Orthogonalized by Newton-Schulz — was used to train Kimi K2 and GLM-5, two open-source models. The post describes NVIDIA’s approach as balanced across generality, throughput, and implementation complexity, with the intent that the same infrastructure supports Muon, SOAP, and other complex optimizers.

What the benchmarks show

NVIDIA ran throughput comparisons on two models on GB300 NVL72 hardware. For Kimi K2, the configuration used 256 GB300 GPUs with PP4DP64EP64 parallelism. For Qwen3 30B-A3B, eight GPUs with DP8EP8. The measurements used NeMo Megatron Bridge 26.02.

The throughput results, in TFLOPs/s/GPU: Kimi K2 with Muon achieved 1,080 versus 1,051 with AdamW. Qwen3 30B-A3B with Muon achieved 721 versus 713 with AdamW. The post notes that model FLOPs utilization is reported as higher with Muon when accounting for the FLOPs from the Newton-Schulz iteration matrix multiplications, which are part of the optimizer’s orthogonalization step.

Why scaling Muon is hard

Muon’s orthogonalization step — using Newton-Schulz iteration or eigen decomposition — is what makes it different from element-wise optimizers like AdamW. That step also creates scaling challenges.

The preconditioning cost increases both computational load and memory consumption. Mixed-precision training and gradient accumulation introduce numerical instability at lower precision. Distributing orthogonalized updates across thousands of GPUs can create communication bottlenecks.

The key infrastructure innovation, according to the post, is a layer-wise distributed optimizer. Traditional element-wise distributed optimizers work for AdamW because they can partition optimizer states and gradients evenly across GPUs. Muon cannot use this approach: it requires gradients for an entire layer to compute the weight update for that layer. If weights and optimizer states are sliced across data-parallel ranks, each GPU has only a shard and cannot independently compute the preconditioner.

The layer-wise optimizer distributes entire layers to specific data-parallel ranks. Each GPU owns full layers, which means it has everything needed for the Newton-Schulz computation. The tradeoff is variable-size communication: because whole layers vary in size, the all-gather operations use all_gatherv rather than fixed-size all_gather. This layer-wise distributed optimizer is now integrated into Megatron Core at layer_wise_optimizer.py.

Handling tensor parallelism

Tensor parallelism (TP) splits individual weight matrices across multiple GPUs. This creates a specific problem for Muon: the momentum buffer for a weight matrix is sharded, but the Newton-Schulz step needs the full matrix to compute the orthogonalization.

NVIDIA implemented TensorParallelMuon with three modes. Duplicated mode all-gathers momentum buffers across the TP domain so each GPU can run the full Newton-Schulz iteration; one all-gather per update regardless of iteration count. Distributed mode spreads the Newton-Schulz computation across GPUs, with an all-reduce after the first matrix multiplication of each iteration. Blockwise mode skips cross-GPU communication entirely by doing orthogonalization only on the local shard.

Additional optimizations described in the post include communication hiding (delaying parameter all-gathers to overlap with the next forward pass), round-robin load balancing across layer sizes, and a SYRK optimization. On the SYRK optimization, the post states that “close to half of the floating point operations can be saved” by mapping two of the three matrix multiplications in a Newton-Schulz iteration to symmetric rank-K updates, though diagonal tiles require full computation.

The layer-wise distributed optimizer is available in the open-source Megatron Core codebase. The post describes the integration as making the tooling available to anyone running NeMo-based training at scale.