Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pre-Rollout Filtering (GRESO)

Updated 21 February 2026
  • Pre-rollout filtering is a technique that preemptively discards uninformative samples before expensive policy rollouts in reinforcement learning.
  • GRESO and related methods leverage historical reward dynamics and latent trajectory statistics to achieve up to 3.4× fewer rollouts and 2.4× speedup in wall-clock time.
  • Instantiations in LLM reasoning, diffusion models, and adaptive security demonstrate enhanced sample efficiency and robust policy performance with minimal computational overhead.

Pre-rollout filtering refers to algorithmic strategies for identifying and discarding uninformative or low-utility samples prior to performing expensive policy rollouts in reinforcement learning and sequential decision optimization. This family of techniques aims to reduce the computational overhead of rollout-based training pipelines—especially for large models—by predicting low-impact samples or by leveraging proxies (such as historical reward dynamics or latent trajectory statistics) to skip redundant or ineffective rollouts, without sacrificing downstream policy quality. Variously instantiated in contexts as diverse as LLM reasoning, generative diffusion models, and adaptive security policy synthesis, pre-rollout filtering is increasingly central to scalable, efficient, and robust RL-with-rollout frameworks (Zheng et al., 2 Jun 2025, &&&1&&&, Hammar et al., 21 Jul 2025).

1. Motivation and Empirical Foundations

The primary empirical insight motivating pre-rollout filtering is the prevalence of “uninformative” samples during rollout-heavy RL—samples for which repeated generation yields little to no beneficial learning signal. Empirical studies in group-based RL for LLM reasoning reveal that, late in training, over 80% of sampled prompts yield identical rewards across repeated generations (so-called “zero-variance prompts”), which do not contribute to the policy improvement step (Zheng et al., 2 Jun 2025). Retaining these uninformative prompts imposes unnecessary inference cost and inflates wall-clock training time.

Similar patterns are observed in generative RL for diffusion or flow models: a large fraction of trajectory samples cluster around group-mean reward, contributing minimally to gradient signal. Pruning or filtering these trajectories post-hoc (as in “dynamic sampling” or post-rollout optimal variance filtering) improves sample efficiency, but incurs full rollout cost for ultimately discarded samples (Ge et al., 17 Dec 2025).

Empirical reward dynamics further show that informativeness (or lack thereof) is highly temporally persistent. For LLM prompts, the event of being zero-variance in one epoch almost always predicts the same for future epochs (conditional probability exceeding 90% across epochs). This strong temporal autocorrelation motivates simple, reward-history–based skip heuristics that preclude most uninformative rollouts a priori (Zheng et al., 2 Jun 2025).

2. Formalism and Instantiations

Pre-rollout filtering is instantiated differently depending on problem domain, RL framework, and reward structure. In chain-of-thought LLM reasoning with Group Relative Policy Optimization (GRPO), the standard pipeline samples a batch of prompts, generates GG responses for each via the current policy πθ\pi_\theta, computes normalized intra-group advantages

Ai=rimean(r1...rG)std(r1...rG),A_i = \frac{r_i - \mathrm{mean}(r_1...r_G)}{\mathrm{std}(r_1...r_G)},

and only prompts with Var(r(k))>0\mathrm{Var}(r^{(k)}) > 0 contribute to policy gradients (Zheng et al., 2 Jun 2025).

The pre-rollout filtering objective is to replace naïve uniform sampling with a filter that probabilistically skips prompts historically likely to be zero-variance (see Section 3).

In generative models, particularly diffusion or flow-based architectures, pre-rollout filtering relies on intermediate latent representations. Here, early-in-trajectory latents and cheap “one-step ODE previews” are used to predict final reward distribution (proxy reward) at partial trajectory checkpoints. Trajectories whose previewed rewards cluster with low variance are pruned before full rollout completion (Ge et al., 17 Dec 2025).

For adaptive network security framed as POMDPs, belief-state particle filtering (“GRESO” in this context) produces a filtered belief over system states, which is then used to compute rollout-based policy adaptions leveraging aggregated policy approximations (Hammar et al., 21 Jul 2025). While not strictly “pre-rollout” in the sampling sense, filtering noise and uncertainty from observed dynamics before expensive planning simulations serves an analogous purpose.

3. Algorithmic Details and Scoring Criteria

GRESO eschews auxilliary classifiers in favor of simple trace statistics. For each prompt xix_i, the trace Ti={(ei,j,Ri,j)}T_i=\left\{(e_{i,j}, R_{i,j})\right\} records the epoch and reward clustering in previous rollouts. The current “consecutive zero-variance count” ziz_i is maintained:

zi=max{kn:j=nk+1nIi,j=1}z_i = \max\left\{k \leq n : \prod_{j=n-k+1}^n I_{i,j}=1\right\}

where Ii,j=1I_{i,j}=1 iff Var(Ri,j)=0\mathrm{Var}(R_{i,j})=0.

Given a base exploration parameter pe(0,1)p_e\in(0,1), the skip probability is

pf(xi)=1pezip_f(x_i) = 1 - p_e^{z_i}

Such prompts are skipped stochastically, with minimum exploration guaranteed because pe>0p_e>0. Two exploration rates are tracked and adapted for “easy” and “hard” prompts, with hyperparameters annealed to achieve a target effective prompt rate (fraction of informative prompts post-rollout).

The overall algorithm alternates between populating a candidate batch via pre-rollout filtering, performing rollouts on the surviving prompts, further filtering post hoc, and adjusting pep_e to maintain effective rate targets. Fast O(1)O(1) lookups per prompt and negligible per-batch overhead are characteristic features.

Pro-GRPO performs “expand-and-prune” sampling with latent-based early pruning:

  • At selected checkpoint steps tit_i, each group trajectory’s latent is projected forward one ODE step to predict its terminal reward (i.e., x^T,ODE(g)zi(g)+(Tti)bθODE(zi(g),ti)\hat x_{T,ODE}^{(g)} \approx z_i^{(g)} + (T-t_i) b_\theta^{ODE}(z_i^{(g)}, t_i)).
  • The VAE-decoded latent is scored by the reward model R(,c)R(\cdot, c).
  • Optimal Variance Filtering (OVF) selects the size-Ki+1K_{i+1} subset of surviving trajectories that maximize within-group variance over the preview rewards.
  • Pruned trajectories are terminated early; survivors proceed to the next checkpoint or to final reward calculation.

Pre-rollout pruning is performed at multiple scheduled steps, with the survivor cardinality and pruning schedule empirically optimized for performance.

The filtering phase is a particle filter that approximates the evolving belief state distribution via MM weighted particles, updated by system transition and observation kernels. At each step, low-weighted particles may be resampled to avoid degeneration. The filtered belief directly feeds into rollout-based policy synthesis, with both algorithmic steps tightly coupled in the sequential decision pipeline.

4. Computational Efficiency and Scaling Behavior

Pre-rollout filtering algorithms achieve computational savings by minimizing full-policy rollout on low-signal samples:

  • In LLM reasoning, GRESO achieves up to 3.4×3.4\times reductions in rollout count and up to 2.4×2.4\times wall-clock speedup in rollout time, with total training wall-time reduced by up to 2.0×2.0\times, on models such as Qwen2.5-Math-1.5B, DeepSeek-R1-Distill-Qwen-1.5B, and Qwen2.5-Math-7B (Zheng et al., 2 Jun 2025).
  • In diffusion/flow models, Pro-GRPO achieves 1.26×1.26\times to 1.41×1.41\times speedup in overall training FLOPs depending on pruning aggressiveness, while increasing useful sample diversity (within-group variance) and final alignment rewards (Ge et al., 17 Dec 2025).
  • For adaptive security, belief filtering operates in O(M)O(M) per time step; empirical adaptivity is realized with exection times ranging from $0.01$ to $0.95$ seconds per decision, compared to tens of seconds or minutes for some alternatives (Hammar et al., 21 Jul 2025).

The cost of filtering—whether latent-based proxy reward evaluation or trace lookup and probabilistic skipping—is consistently negligible compared to the savings from avoiding full trajectory inference and model evaluation.

5. Empirical Results and Practical Impact

Pre-rollout filtering maintains or improves benchmark performance relative to baseline or naive post-hoc filtering strategies:

  • In LLM math reasoning, GRESO yields equal or modestly superior final accuracy (up to $0.3$ percentage points) on six aggregate benchmarks compared to dynamic sampling, while producing 2×2\times more effective samples per unit compute (Zheng et al., 2 Jun 2025).
  • In generative diffusion and flow-based models, Pro-GRPO consistently improves in-domain and out-of-domain reward-based metrics (e.g., PickScore, Aesthetic, HPSv2.1) and achieves higher reward variance among survivors, even as compute is reduced (Ge et al., 17 Dec 2025).
  • For network security, filtered-belief rollout adapts to non-stationary shifts orders of magnitude faster than PPO and outperforms expert-planning-based baselines (e.g., C-POMCP, cardiff) in both synthetic and testbed settings (Hammar et al., 21 Jul 2025).

Adaptive mechanisms for exploration (annealed pep_e in LLMs; multi-step pruning in generative models) allow the methods to dynamically track effectiveness without hand-tuning, stabilizing effective sample ratios throughout training.

6. Limitations and Directions for Extension

Pre-rollout filtering as currently formulated in GRESO and related methods is primarily binary or groupwise with respect to informativeness: zero-variance or low-variance prompts/trajectories are filtered, but no graded utility or soft ranking is present. As a result, some low-signal but nonzero variance samples may escape filtering, and potential for further efficiency gains exists.

Current evaluations focus on specific model classes (GRPO for LLMs, flow models, or particle-belief in POMDPs); generalization to PPO, RLHF, preference-based reward, and multi-task or off-policy settings may require adapting the informativeness criterion or the underlying scoring function.

Potential future extensions include:

  • Replacing binary detection by utility regression (e.g., learning a lightweight predictor for advantage magnitude).
  • Incorporating prompt/trajectory features beyond reward trace or latents.
  • Applying meta-optimization to learn the exploration/skip policy rather than heuristic pep_e adjustment.
  • Extending latent-based pruning to deeper recursive pruning schedules or batch-level diversity maximization.

7. Connections to Broader Research

Pre-rollout filtering is closely related to dynamic data pruning, adaptive sampling, and intelligent computational budgeting in RL and inference. Empirical results demonstrate its compatibility not just with policy optimization in LLMs and generative models, but also with belief-state filtering, rollout planning, and approximate dynamic programming. The unifying theme is the statistical redundancy exposed by reward or utility clustering—and the general efficiency achieved by statistical estimation or proxy inference in guiding selective computation (Zheng et al., 2 Jun 2025, Ge et al., 17 Dec 2025, Hammar et al., 21 Jul 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pre-Rollout Filtering (GRESO).