Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

^*Equal contribution

¹Georgia Institute of Technology, ²University of Illinois Urbana-Champaign

Abstract

Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address the above limitations via decomposing each step into three stages: a short fast trajectory of inner steps on the same batch, a reposition mechanism to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces number of rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points in average on math reasoning benchmarks. It also achieves up to 4.93× fewer rollouts and a 4.19× reduction in wall-clock time to match GRPO’s best accuracy.

Method

Stage I: Fast Trajectory

In standard on-policy policy-gradient methods such as GRPO, each step is updated by a single stochastic gradient:

\[ \begin{align} \theta^{s+1} = \theta^{s} - \eta \nabla_{\theta} \mathcal{L}(\theta^{s}), \end{align} \]

where $\nabla_{\theta} \mathcal{L}(\theta^{s})$ is estimated from one batch of rollouts. Such one-shot updates suffer from high variance and often drive the policy in unstable directions, especially during early training. SFPO mitigates this by performing multiple inner updates on the same batch or rollouts.

Formally, starting from parameters $\theta^{s,0}$ at the beginning of step $s$, we execute a short fast trajectory of $K$ inner updates:

\[ \begin{align} \theta^{s,k+1} = \theta^{s,k} - \eta \nabla_{\theta} \mathcal{L}(\theta^{s,k}), \qquad k=0,\ldots,K-1. \end{align} \]

This produces a sequence $\theta^{s,0} \to \theta^{s,1} \to \cdots \to \theta^{s,K}$, where each step refines the gradient direction using the same rollout data.

Stage II: Reposition

While the fast trajectory of Stage I improves stability, it also changes the nature of the update from on-policy to off-policy. Since all inner steps $\theta^{s,1},\dots,\theta^{s,K}$ reuse the same rollouts generated at $\theta^{s,0}$, the endpoint $\theta^{s,K}$ no longer corresponds to the distribution that produced those samples. This distribution mismatch is a fundamental drawback of off-policy learning, as it biases gradient estimates and can destabilize training.

Inspired by Lookahead Optimization, SFPO introduces a reposition step that interpolates the fast trajectory back toward its starting point:

\[ \begin{align} \tilde{\theta}^{s,K} = \theta^{s,0} + \alpha(\theta^{s,K} - \theta^{s,0}), \qquad \alpha \in [0,1]. \end{align} \]

Here $\alpha$ regulates the degree of off-policy drift: smaller values keep the update close to the original on-policy iterate, while larger values rely more on the fast trajectory at the risk of greater mismatch.

Stage III: Slow Correction

After repositioning, SFPO applies one more (slow) correction step at the interpolated point:

\[ \begin{align} \theta^{s+1} = \tilde{\theta}^{s,K} - \eta \nabla_{\theta} \mathcal{L} (\tilde{\theta}^{s,K}). \end{align} \]

This yields a predictor—corrector structure: Stage I produces a stabilized fast trajectory, Stage II tempers off-policy drift via reposition, and Stage III applies a slow correction aligned with the local curvature at the update point.

More Efficient Training

We evaluate the efficiency gains of SFPO over GRPO by comparing the total number of rollouts and the wall-clock time required to reach the same benchmark accuracy. Specifically, SFPO requires 3.21×, 3.50×, and 4.93× fewer rollouts than GRPO for DS-Qwen-1.5B, Qwen3-4B-Base, and DS-Qwen-7B, respectively, to reach the same best accuracy. This advantage directly translates into reduced training time, where SFPO achieves 2.62×, 2.65×, and 4.19× speedups over GRPO for the same models, significantly lowering the training cost. Note that SFPO does not introduce extra GPU memory overhead as it does not need to store the heavy optimizer status. These significant efficiency gains align with our expectations, since the primary bottleneck in the training process lies in rollout generation, which accounts for more than 70% of the overall inference time. By substantially reducing the number of rollouts required and harnessing the reposition mechanism, SFPO alleviates this bottleneck and achieves faster training.

Better Performance on Math Reasoning Benchmarks

SFPO outperforms vanilla GRPO on math reasoning benchmarks. Specifically, for small-scale models such as Qwen2.5-Math-1.5B and DS-distilled-Qwen-1.5B, SFPO demonstrates superior performance enhancements on math reasoning benchmarks, raising the average accuracy from 38.35 to 40.19 with an absolute gain of +1.84, and from 47.73 to 50.53 with a gain of +2.80, respectively. The improvements are particularly pronounced on challenging tasks such as AIME24 and AIME25, where DS-distilled-Qwen-1.5B achieves an absolute gain of +7.5 on AIME25. The larger models also exhibit similar performance gains. For Qwen2.5-Math-7B, SFPO raises the average accuracy from 48.36 to 49.19 with an absolute gain of +1.80. For DS-distilled-Qwen-7B, SFPO boosts the average accuracy from 60.47 to 63.04, corresponding to an absolute gain of +0.8. For Qwen3-4B-Base model, SFPO improves average accuracy from 43.99 to 45.59, an absolute gain of +1.60, highlighting its robustness across various models.

BibTeX

@misc{wang2025slowfastpolicyoptimizationrepositionbeforeupdate, title={Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning}, author={Ziyan Wang and Zheng Wang and Jie Fu and Xingwei Qu and Qi Cheng and Shengpu Tang and Minjia Zhang and Xiaoming Huo}, year={2025}, eprint={2510.04072}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2510.04072}, }