GSPO replaces token-level RL clipping with sequence-level optimization, fixing MoE training collapse

The Qwen team at Alibaba has published Group Sequence Policy Optimization (GSPO), a new reinforcement learning algorithm designed to replace GRPO as the training backbone for large language models. The post describes GRPO as exhibiting “severe instability issues during long training” that lead to “irreversible model collapse,” preventing further performance gains with increased compute. GSPO was developed to address this directly and is credited as the training algorithm behind the latest Qwen3 models (Instruct, Coder, and Thinking variants).

The core distinction is where the importance ratio is computed. GRPO computes importance ratios at the token level; GSPO computes them at the sequence level. This single architectural change, according to the post, cascades into substantial practical benefits in stability, efficiency, and infrastructure simplicity.

The sequence-level optimization objective

GSPO defines its importance ratio $s_i(\theta)$ as the geometric mean of per-token log-probability ratios, normalized by sequence length. This length normalization reduces variance and unifies the numerical range across sequences of different lengths — a property that token-level approaches lack.

The objective clips this sequence-level ratio within $[1-\varepsilon, 1+\varepsilon]$, applying a single clip per response rather than per token. The post notes an empirically striking observation: the fraction of tokens clipped in GSPO is “two orders of magnitude higher” than in GRPO, yet GSPO still achieves higher training efficiency. The team interprets this as evidence that “GRPO’s token-level optimization objective is noisy and inefficient, while GSPO’s sequence-level approach provides a more reliable and effective learning signal.”

Experiments used a cold-start model fine-tuned from Qwen3-30B-A3B-Base. Performance curves on AIME’24, LiveCodeBench, and CodeForces are shown against GRPO as the baseline. The post reports that GSPO “demonstrates significantly higher training efficiency than GRPO, achieving better performance under the same training cost” and can “deliver continuous performance improvement through increasing the training compute, regularly updating the query set, and extending the generation length.”

Fixing Mixture-of-Experts RL

The post identifies a specific failure mode that blocked GRPO from training large MoE models: expert activation volatility. In MoE architectures, different experts are activated for different inputs, and the routing patterns differ between the old and current policy. When computing token-level importance ratios, this routing divergence introduces noise that prevents convergence.

The Qwen team’s previous workaround was a technique called Routing Replay: caching the expert activations from the old policy and replaying them during importance ratio computation for the current policy. While functional, this required “additional memory and communication overhead and may limit the actual capacity of MoE models.”

GSPO eliminates the need for Routing Replay entirely. Because GSPO only uses sequence-level likelihoods — not token-level ones — it is insensitive to the specific routing patterns of individual tokens. The post describes this as “completely eliminating the dependency on Routing Replay,” which both simplifies training infrastructure and allows MoE models to use their full routing capacity.

Infrastructure implications

The sequence-level design also has a secondary infrastructure benefit. Token-level importance ratios require recomputation using the training engine because inference engines return sequence probabilities, not the per-token distributions needed for precise token-level ratios. GSPO’s sequence-level objective can directly consume the likelihoods returned by inference engines, potentially eliminating a recomputation step.

The post highlights this as “particularly beneficial in scenarios such as partial rollout, multi-turn RL, and training-inference disaggregated frameworks” — all of which are active areas in scaled RL infrastructure. The team describes GSPO as “fundamentally more tolerant to precision discrepancies” as a result.

Connection to Qwen3

The post states that GSPO was “successfully applied to the large-scale RL training of the latest Qwen3 models,” and attributes part of those models’ reasoning improvements to the algorithm. The paper is published as arXiv:2507.18071, authored by a team of twelve researchers at Alibaba.

The central claim is that the shift from token-level to sequence-level optimization is not a minor tuning choice but a foundational design decision with compounding benefits: better convergence, no special-case infrastructure for MoE, and clear scalability as compute increases. For teams running RL post-training at scale on MoE architectures, GSPO represents a meaningful simplification of both the algorithm and the engineering stack underneath it.