SRT: Accelerating Reinforcement Learning via Speculative Rollout with Tree-Structured Cache

Published 14 Jan 2026 in cs.LG | (2601.09083v1)

Abstract: We present Speculative Rollout with Tree-Structured Cache (SRT), a simple, model-free approach to accelerate on-policy reinforcement learning (RL) for LLMs without sacrificing distributional correctness. SRT exploits the empirical similarity of rollouts for the same prompt across training steps by storing previously generated continuations in a per-prompt tree-structured cache. During generation, the current policy uses this tree as the draft model for performing speculative decoding. To keep the cache fresh and improve draft model quality, SRT updates trees online from ongoing rollouts and proactively performs run-ahead generation during idle GPU bubbles. Integrated into standard RL pipelines (\textit{e.g.}, PPO, GRPO and DAPO) and multi-turn settings, SRT consistently reduces generation and step latency and lowers per-token inference cost, achieving up to 2.08x wall-clock time speedup during rollout.

Abstract PDF Upgrade to Chat

Summary

The paper introduces SRT, a tree-structured cache that accelerates on-policy RL by speculative rollout, achieving up to a 2.08× speedup in rollout time.
It integrates cache-based speculative decoding into various RL algorithms without deviating from the on-policy distribution, ensuring stable convergence.
The method leverages both online and run-ahead cache updates to reduce redundant computation and lower per-token inference costs in LLM training.

SRT: Accelerating On-Policy RL in LLMs via Speculative Rollout and Tree-Structured Caching

Introduction and Motivation

This paper presents Speculative Rollout with Tree-Structured Cache (SRT), a framework for accelerating on-policy reinforcement learning (RL) in LLMs through an efficient, purely model-free speculative decoding procedure employing a per-prompt, tree-structured cache. The work addresses major bottlenecks in RL training for LLMs, particularly the extreme compute cost and latency of autoregressive rollout—where generating responses dominates wall-clock time, often more than 65% of the total, as measured across various RL algorithms. The inefficiencies arise not only from the inherently memory-bound, sequential token generation in transformer models but are exacerbated by long-tailed length distributions of responses in batched decoding, which induce substantial GPU underutilization.

Further empirical analysis reveals that, for the same prompt, responses generated by RL-trained LLMs across different policy checkpoints tend to have high overlap, especially when sampled in large numbers (e.g., in DAPO or GRPO). This statistical redundancy signals a largely untapped opportunity: much of the expensive rollout computation is recapitulating previously generated, near-identical subsequences.

Figure 1: (a) Time breakdown across RL algorithms, emphasizing rollout cost; (b) Long-tailed output length distribution; (c) Substantial N-gram overlap in prompt responses across steps.

Algorithmic Framework

SRT integrates into the standard on-policy RL loop for LLMs, which alternates between (1) generation ("rollout") of $K$ samples per prompt from the current policy $\pi_\theta$ and (2) parameter updates using policy gradient objectives based on task-specific rewards. The core innovation is the introduction of a per-prompt rollout cache $\mathcal{T}_p$ , maintained as a tree-shaped prefix trie storing all previously generated token substrings for prompt $p$ .

During rollout, SRT exploits this cache for speculative decoding, akin to retrieval-augmented inference but operating purely model-free, without an explicit draft model. Generation proceeds as follows:

For a partial continuation $y_{1:t}$ , locate the longest matching suffix in $\mathcal{T}_p$ .
From the corresponding node, greedily assemble a maximal chain of likely tokens by following the most frequent child at each depth, as captured by empirical conditional probabilities.
Propose the resulting subsequence as "draft" tokens, which are then batch-verified by the current policy's logits using the standard speculative decoding verification phase. Accepted tokens are emitted up to the first policy mismatch.

This modular cache-based acceleration can be seamlessly integrated with policy gradient variants, including PPO, GRPO, and DAPO.

The cache is dynamically and proactively enriched by two mechanisms:

Online insertion of every new token produced during rollout, immediately leveraging fresh outputs for subsequent samples of the same prompt.
Run-ahead generation during idle compute bubbles—when GPUs become available due to early batch completion—by speculatively generating partial rollouts for prompts likely to appear in the near future, further mitigating cold-start and maximizing accepted speculative tokens.
Figure 2: Schematic of cache maintenance strategy: online updates from in-batch rollouts and run-ahead speculative generation enrich the cache across training steps.

Experimental Results and Empirical Analysis

SRT is evaluated on the Qwen2.5-1.5B model using four RL algorithms (PPO, GRPO, DAPO, and ReTool), with experiments conducted on canonical mathematics and reasoning datasets. The main outcome measures are rollout (generation) latency, RL step latency, and per-token inference cost.

Key results:

SRT achieves up to 2.08× speedup in wall-clock rollout time, outperforming both conventional decoding and existing speculative decoding schemes such as SuffixDecoding (Oliaro et al., 2024).
Latency improvements are robust to the choice of RL algorithm and hold for both single-turn and multi-turn settings.
Figure 3: SRT achieves substantial speedup in rollout time across PPO, GRPO, DAPO, and ReTool algorithms on Qwen2.5-1.5B.

Detailed analysis correlates the degree of speedup with the similarity of rollouts; algorithms (e.g., DAPO, GRPO) generating more responses per prompt see larger gains due to higher cache hits and greater n-gram overlap (see Figure 2c). Run-ahead speculative generation further increases mean accepted tokens per verification, driving further reductions in end-to-end cost.

SRT contrasts with recent asynchronous rollout methods (e.g., AReaL (Fu et al., 30 May 2025), RLHF stage fusion (Zhong et al., 2024)) that boost hardware throughput by relaxing strict on-policy constraints, often at the potential expense of training convergence or sample quality. SRT, by leveraging only previous on-policy rollouts as speculative drafts, maintains rigorous policy correctness with no distributional shift.

A closely related line is RhymeRL (He et al., 26 Aug 2025), which asynchronously caches completed responses from older epochs but does not perform online or run-ahead cache enrichment, leading to cold-start and lower cache utility for new prompts. SRT directly addresses this limitation with on-the-fly cache updates and anticipatory draft generation during idle compute.

Speculative decoding has been successfully deployed for inference acceleration in LLMs using smaller draft models or retrieval-based candidates (Miao et al., 2023, Oliaro et al., 2024), but SRT is the first to adapt purely model-free, rollout-based speculative decoding for RL training, preserving the on-policy property vital to stable convergence.

Theoretical and Practical Implications

SRT highlights the high degree of redundant computation in RL-based LLM training and offers a general, lossless acceleration mechanism that trades modest additional memory for substantial speedup. The scheme maintains strict consistency with the on-policy distribution, ensuring the statistical properties essential for RL training are unchanged.

Practically, SRT can be adopted in existing RLHF and policy optimization pipelines with minimal changes, leveraging commodity CPU-side cache and native speculative decoding primitives on LLM accelerators. Memory and compute costs scale gracefully with the number of prompts and cache depth, well within the regime of current large-scale LLM infrastructure.

Theoretically, SRT sets a precedent for model-free, experience-based draft strategies for accelerating token-level operations, suggesting future work in hierarchical cacheing, adaptive run-ahead, and dynamic draft selection as RL algorithms and LLMs continue to scale.

Conclusion

SRT provides an effective, scalable solution to a major RL efficiency bottleneck for LLMs by leveraging per-prompt, tree-structured caches for speculative rollout. With empirically validated gains—over 2x reduction in rollout cost in on-policy RL—the method offers practical utility and a straightforward path toward more efficient, scalable LLM RL training regimes. Its design principles and cache-based speculative execution are likely to inform future research in efficient RL and adaptive inference for next-generation LLMs.

Markdown Report Issue