DW-GRPO: Dynamic Weighting in RL Policy Optimization

Updated 19 January 2026

DW-GRPO is a reinforcement learning framework that dynamically adjusts reward contributions to correct static biases in policy optimization.
It integrates learnable token preferences and adaptive difficulty weighting through mechanisms like λ-GRPO, DARO, and HA-DW to balance gradient flows.
Empirical evaluations show faster convergence and improved accuracy in LLMs, with reduced computational overhead and enhanced exploration-exploitation balance.

Dynamic Weighting Reward Group Relative Policy Optimization (DW-GRPO) is a family of frameworks in Reinforcement Learning with Verifiable Rewards (RLVR) that adaptively modulate the contribution of sampled trajectories, tokens, difficulty groups, and objectives during policy optimization. By integrating dynamic weighting mechanisms into Group Relative Policy Optimization (GRPO), DW-GRPO counteracts static biases in token and group-level advantage assignment, enhances exploration-exploitation balance, and consistently improves convergence and final accuracy in LLMs and generative models. DW-GRPO encompasses learnable token preference methods (e.g., $\lambda$ -GRPO), automatic difficulty-scale balancing (DARO), diversity-aware reward assignment (MMR-GRPO), entropy-guided credit shaping (GTPO), history-aware bias correction (HA-DW), and multi-objective dynamic scalarization regimes.

1. Unified GRPO Foundation and Static Bias Pathology

GRPO formulates policy optimization in LLMs as an episodic Markov Decision Process, assigning each sampled response a verifiable reward and a group-relative advantage based on normalized statistics ( $\mu_R$ , $\sigma_R$ ) across multiple completions per prompt. The GRPO objective distributes the same advantage $\hat{A}_{i,t}$ uniformly to all tokens and typically uses uniform or fixed weights $w_{i,t}$ for credit assignment. This static scheme exhibits a pronounced length bias: longer completions cause reward "dilution" across more tokens, which can induce verbosity and misbalance the gradient magnitude. Moreover, certain difficulty groups (e.g., medium pass-rate prompts) come to dominate loss scale, leading to over-exploration or premature exploitation.

DW-GRPO remedies these pathologies through dynamic and adaptive weighting—for tokens, groups, and reward types—so the optimization trajectory self-balances according to model capability, sample difficulty, trajectory diversity, and user goals (Wang et al., 8 Oct 2025, Zhou et al., 10 Oct 2025, Wei et al., 14 Jan 2026, Yang et al., 13 Jan 2026).

2. Learnable Token Preferences: The $\lambda$ -GRPO Architecture

$\lambda$ -GRPO (Wang et al., 8 Oct 2025) introduces a single learnable scalar $\lambda$ that parameterizes the per-token weighting in the GRPO objective, enabling the model to optimize its own token-preference profile. For each response $o_i$ in a sampled group, the framework computes standardized response lengths $z_i$ , applies a length reducer $r$ , and exponentiates via $\lambda$ : $g_i(\lambda) = h_i^\lambda$ , where $h_i = 1 + r \cdot z_i$ . Softmax normalization yields relative weights $s_i(\lambda)$ , which are rescaled by group size. Each token weight becomes

$w_{i, t}(\lambda) = \frac{f(o_i; \lambda)}{\sum_{j=1}^G |o_j|}.$

The joint objective

$L(\theta, \lambda) = -\frac{1}{\sum_i |o_i|} \sum_{i=1}^G f(o_i;\lambda)\;\sum_{t=1}^{|o_i|} \min\big( r_{i,t}(\theta)\hat{A}_{i,t},\, \text{clip}(r_{i,t}(\theta),1-\epsilon,1+\epsilon) \hat{A}_{i,t} \big)$

is differentiated both with respect to policy $\theta$ and preference $\lambda$ ; weight gradients are computed using the chain rule through the softmax. This mechanism directly learns the optimal reward allocation for response lengths, automatically moving toward length-neutral, verbosity-suppressing, or detail-promoting regimes as dictated by empirical loss.

Empirical evaluation on Qwen2.5 models establishes consistent accuracy gains (+1.0–1.9% absolute on mathematical reasoning benchmarks) with no increase in computational or data cost (Wang et al., 8 Oct 2025).

3. Dynamic Difficulty Group Weighting: DARO and HA-DW Paradigms

Difficulty-Aware Reweighting Policy Optimization (DARO, also denoted DW-GRPO) (Zhou et al., 10 Oct 2025) generalizes static difficulty weights $w_\mu$ to trainable group weights for each empirical pass-rate bucket $\mu$ . The regularized multitask objective

$L_{\text{DARO}} = \sum_{\mu} \left[ w_\mu L_\mu - \ln w_\mu \right]$

seeks to equalize groupwise surrogate loss magnitudes, ensuring each difficulty receives balanced gradient flow and no bucket dominates due to scaling irregularities. Group losses $L_\mu$ are approximated as

$L_\mu \approx \left[ \frac{\sum_{o \in G_\mu, r=1}|o| - \sum_{o \in G_\mu, r=0}|o|}{L} \right] \sqrt{ \mu(1-\mu) }$

updates $w_\mu \propto 1/L_\mu$ and regularizes with $-\ln w_\mu$ . This maintains exploration in hard buckets as the model improves, avoiding catastrophic forgetting and gradient collapse.

History-Aware Adaptive Difficulty Weighting (HA-DW) (Yang et al., 13 Jan 2026) exposes the analytical bias of group-relative advantage estimation, especially in non-degenerate Bernoulli reward distributions. HA-DW dynamically tracks a running anchor $C_t$ of prompt accuracy via a Kalman-type update and reweights each rollout's advantage by

$\Phi_{t, i} = \lambda_{\rm scale} \exp( D_{t, i} M_t )$

where $D_{t, i}$ and $M_t$ encode prompt deviation from the historical anchor and advantage sign. This corrects GRPO's bias toward underestimating hard prompt advantage and overestimating easy ones, leading to a more robust trade-off between exploration and exploitation.

4. Diversity-, Entropy-, and Hybrid Reward Weighting

Diversity-aware reward reweighting via Maximal Marginal Relevance (MMR-GRPO) (Wei et al., 14 Jan 2026) reorders completion rewards by their informativeness (balancing relevance and semantic novelty) using:

$\text{score}(y_i) = \lambda r_i - (1-\lambda)\max_{j \in S} \cos(e_i, e_j)$

where $e_i$ are embedding representations. The reweighted rewards $\tilde{r}_i$ become normalized weights $w_i$ for advantage computation. This upweights novel, high-reward completions and downweights redundant ones, sharply reducing training steps and wall-clock time (~47.9% and ~70.2% reductions). Adaptive $\lambda$ values based on group reward variance optimize the relevance-diversity balance automatically.

Entropy-weighted credit assignment (GTPO and GRPO-S) (Tan et al., 6 Aug 2025) leverages token-level policy entropy as an indicator of decision uncertainty and pivotal reasoning steps. GTPO shapes per-token reward as

$\tilde{r}_{i, t} = 1 + \alpha \frac{H_{i, t}}{ \sum_{k: r_k = 1} H_{k, t} d_t }$

for successful sequences, where $H_{i, t}$ is token entropy. Sequence-level GRPO-S analogously shapes whole-sequence reward. These mechanisms drive updates toward high-entropy points, deepening exploration, extending chain-of-thought, and improving performance on long-chain reasoning tasks.

Dynamic Hybrid Policy Optimization (DHPO) (Min et al., 9 Jan 2026) performs weighted mixing between token-level and sequence-level importance ratios, with per-token mixing weights guided by entropy or set to constants. Branch-specific clipping further stabilizes learning.

5. Multi-Objective Dynamic Weighting and Pareto Alignment

DW-GRPO extends beyond single-reward RLVR to multi-objective scenarios (Lu et al., 14 Sep 2025). Here, every trajectory incurs $K$ reward components. Dynamic weights $w^{(t)}$ modulate the scalarization of objectives online. Hypervolume-guided adaptation increases scalar weights for trajectories expanding the empirical Pareto front. Gradient-based mirror descent updates $w$ directly from policy gradient influences:

$w_i^{(t)} \propto w^{(t-1)}_i \exp\left[ (\alpha_t/\mu) I_i^{(t)} \right]$

where $I_i^{(t)}$ quantifies per-objective influence, and $\alpha_t$ , $\mu$ are meta-learning rates and regularizers. Both mechanisms provably overcome the convexity limitations of fixed linear scalarization, yielding broader and higher-covering Pareto fronts; efficiency gains are evidenced by up to 6K fewer steps and strict Pareto dominance in accuracy, conciseness, and clarity metrics.

6. Training Protocols, Empirical Performance, and Implementation Considerations

DW-GRPO variants share a high degree of engineering and computational efficiency. Most dynamic weighting mechanisms (e.g., $\lambda$ -GRPO, difficulty-group weights, diversity weights) add negligible FLOPs (often <5% per-step overhead) relative to policy gradient computation. AdamW or SGD with decoupled learning rates for policy and weights is commonly used. Per-group weights require $O(K)$ updates per batch, token-preference $\lambda$ is a single scalar parameter, and semantic embedding computations scale quadratically in group size (but are tractable for $G \leq 64$ ).

Empirical evaluations across Qwen2.5, Qwen3, LLaMA, DeepSeek, and Wan2.1 backbones consistently show superior convergence speed, final performance, and chain-of-thought entropy relative to static GRPO variants. Notable gains: +1.0–2.6% accuracy in $\lambda$ -GRPO and HA-DW, and 47.9%–70.2% reductions in wall-clock time for MMR-GRPO.

7. Extensions, Limitations, and Open Research Directions

DW-GRPO is amenable to further extension. Its trainable token-preference mechanism could be generalized to multi-dimensional vectors (e.g., $\lambda \in \mathbb{R}^k$ for style, factuality, and clarity), with corresponding multi-head softmax weighting. Difficulty weighting may leverage continuous regression-based metrics, per-layer or meta-learned schedules. Combining dynamic weighting with KL-penalties or consistency rewards enables fine-grained alignment for dialog agents and verification-intensive tasks. However, DW-GRPO does not directly address reward model noise, bucket sparsity, or consistency biases; quality of semantic embedding and effective sample sizes remain critical for robust implementation. Cross-family applicability (beyond Qwen models) and scalability to extreme model sizes are open for empirical validation.

In summary, DW-GRPO unifies the dynamic modulation of reward assignment, trajectory weighting, and multi-objective scalarization within RLVR. By directly learning and adapting these weights, DW-GRPO eliminates static loss-scale pathologies, corrects inherent estimator biases, and facilitates a more robust, effective, and efficient path to high-accuracy, interpretable model alignment (Wang et al., 8 Oct 2025, Zhou et al., 10 Oct 2025, Wei et al., 14 Jan 2026, Yang et al., 13 Jan 2026, Lu et al., 14 Sep 2025, Tan et al., 6 Aug 2025, Min et al., 9 Jan 2026).