Sparse Gradient Optimization for Long-Horizon RL

Updated 4 January 2026

The paper demonstrates that heavy-tailed policy distributions and dense surrogate signals effectively overcome vanishing gradients in long-horizon, sparse-reward settings.
It details methods such as momentum-based tracking, reward dithering, and hierarchical policy structures to reduce high variance and stabilize learning.
Empirical benchmarks show that these techniques lead to 2–5× faster convergence and higher success rates in complex tasks like robotic manipulation and LLM-based navigation.

Sparse Gradient for Long-Horizon Reward Optimization

Sparse gradient propagation under long-horizon, sparse-reward regimes is a central problem in deep reinforcement learning (RL) and policy optimization for high-dimensional control, natural language agents, and robotic manipulation. In these contexts, rewards are infrequent and feedback is obtained only at the terminal or few critical points within potentially thousands of timesteps or generation steps. This section details the algorithmic foundations, theoretical principles, and experimental findings for methods explicitly designed to overcome vanishing or unstable gradients in such settings.

1. Problem Setting and Challenge of Sparse Gradients

In standard RL, policy-gradient algorithms estimate the gradient of the expected return with respect to policy parameters $\theta$ using the identity

$\nabla_\theta J(\theta) = \mathbb{E}_{\rho_\pi}[\nabla_\theta \log \pi_\theta(a|s) \, A(s, a)]$

where $A$ is an advantage function and $\rho_\pi$ is the state-action visitation distribution. In long-horizon problems with sparse rewards— $r(s_t, a_t) = 0$ for most $(s_t, a_t)$ except for rare “goal” states or terminal steps— $A(s, a)$ is zero almost everywhere, leading to extremely high-variance or vanishing policy gradients. As a result, the agent receives little actionable feedback except for rare, often noisy, reward events, resulting in slow, unstable convergence or getting trapped in local optima.

Challenges include:

Vanishing gradients in RL and LLM-based policy optimization when all samples in a batch receive the same (often zero) reward, causing advantage normalization to collapse (Wei et al., 23 Jun 2025, Zhuang et al., 8 Dec 2025, Wang et al., 11 Sep 2025).
High variance in updates: rare positive-reward events yield large, unstable gradient spikes.
Inefficient exploration: standard light-tailed policies fail to reach distant rewarding states in complex continuous control or compositional planning (Chakraborty et al., 2022, Yang et al., 15 Aug 2025).
Ineffective credit assignment: trajectory-level reward signals make it difficult to propagate learning signal to the preceding action or decision points (Ren et al., 2021).

2. Policy Distribution Design and Persistent Exploration

Heavy-tailed policy parameterizations, such as the multivariate Cauchy distribution in the Heavy-Tailed Stochastic Policy Gradient (HT-SPG) algorithm (Chakraborty et al., 2022), are designed to counteract the myopic exploration of standard Gaussian policies. The HT-SPG replaces the usual normal density

$\pi_\theta(a|s) = \mathcal{N}\left(a; \mu(s;\theta), \sigma^2 I\right)$

with a heavy-tailed Cauchy-type distribution

$\pi_\theta(a|s) = \prod_{i=1}^p \frac{1}{\sigma\pi \bigl[1 + ((a_i - \mu_i)/\sigma)^2\bigr]}$

so that action draws with large magnitude relative to the current mean have polynomially decaying probability.

By assigning substantial probability mass far from the mean, heavy-tailed policies generate persistent, long-range exploratory behavior necessary to encounter distant rewarding regions, which would otherwise be missed due to the exponentially decaying tails of a Gaussian. In sparse-reward tasks such as the Pathological Mountain Car and Sparse MuJoCo Hopper-v2, this structure enables the agent to repeatedly discover rewarding states even late in training (Chakraborty et al., 2022).

3. Proxy Rewards and Shaping via Surrogate Gradients

Dense surrogate signal construction is a dominant approach to mitigate sparse reward propagation. Key techniques include:

Predictive Coding for Reward Shaping: Predictive representations learned via contrastive InfoNCE objectives (Lu et al., 2019) or goal-conditioned models (Jiang et al., 2020) can be used to provide dense reward functions. For example, a negative embedding-distance shaping,

$r'(s_t, a_t) = r(s_t, a_t) - \beta \|E_\phi(s_t) - E_\phi(s_g)\|_2^2$

produces reward gradients that guide the agent toward latent representations of the goal, compressing the “horizon” and providing dense signal throughout an episode (Lu et al., 2019).

Goal-Distance Gradient (GDG): The environment is augmented with a learned, differentiable distance function $D(s, g)$ estimating the minimum number of transitions to reach goal $g$ from state $s$ . The surrogate objective is to minimize $\mathbb{E}_s[D(f(s, \mu_\theta(s, g)), g)]$ , generating a dense gradient even with no environmental rewards (Jiang et al., 2020). Bridge-point planning further decomposes tasks into bridgeable subgoals for improved credit assignment.
Trajectory MMD Shaping: Trajectory-Oriented Policy Optimization (TOPO) leverages maximum mean discrepancy between on-policy and demonstration trajectory distributions,

$D_{\mathrm{MMD}}(p_\theta, p_{\mathrm{demo}})$

as an intrinsic penalty, so deviations from the demo manifold incur a dense cost at every timestep. The shaped reward is

$\tilde{r}(s, a) = r^e(s, a) - \sigma r^i(s, a)$

where $r^i(s, a) = \max\{D_{\mathrm{MMD}}((s, a), M) - \delta, 0\}$ , yielding a distributed intrinsic gradient irrespective of sparsity in $r^e$ (Wang et al., 2024).

Randomized Return Decomposition (RRD): Reward redistribution by Monte Carlo decomposition of episodic returns learns a per-step proxy reward $\hat{R}_\theta(s, a)$ to replace sparse external rewards, dramatically improving gradient propagation on long-horizon domains (Ren et al., 2021).

4. Gradient Stabilization and Variance Reduction

Even when dense signals are constructed, long-horizon tasks can present unstable or high-variance gradients due to the noisy nature of credit assignment. Several algorithmic innovations address this:

Momentum-based Gradient Tracking: HT-SPG applies a momentum-tracking scheme inspired by STORM, combining moving averages of past gradients with difference-based correction for parameter shifts (Chakraborty et al., 2022). This reduces variance and accelerates convergence by geometrically decaying the mean-squared tracking error.
Reward Dithering: In discrete, outcome-based settings (e.g., LLM policy optimization), ReDit perturbs the observed reward with independent, zero-mean noise $\epsilon_t \sim N(0, \sigma^2)$ or $\mathrm{Unif}(-a, a)$ , so that the advantage computation always provides a nonzero learning signal (Wei et al., 23 Jun 2025). This mitigates vanishing or spiking gradients and accelerates convergence by boosting the effective variance of reward signals while retaining unbiasedness.
Value-Based Sampling and Clipping: Value-based Sampling Policy Optimization (VSPO) discards batch samples with low reward variance and resamples according to a value metric balancing difficulty and uncertainty; repeated sampling is clipped to cap the influence of individual tasks, thereby stabilizing gradient estimates in GRPO settings (Zhuang et al., 8 Dec 2025).
Entropy-Modulated Gradients: Entropy-Modulated Policy Gradients (EMPG) rescale per-step gradients by a function of step-wise entropy and final trajectory outcome, amplifying confident, correct updates and attenuating noisy, uncertain ones (Wang et al., 11 Sep 2025).

Method	Surrogate dense signal	Variance reduction/stabilization
Heavy-tailed PG (Chakraborty et al., 2022)	Cauchy policy induces exploration	Momentum tracking
Predictive coding (Lu et al., 2019)	Embedding-distance reward shaping	N/A
TOPO (Wang et al., 2024)	MMD intrinsic reward from demos	N/A
RRD (Ren et al., 2021)	Per-step proxy reward (from return)	Subsequence Monte Carlo; smoothing
ReDit (Wei et al., 23 Jun 2025)	Dithered noisy reward	Noise-injected gradient signal
VSPO (Zhuang et al., 8 Dec 2025)	Progressive, shaped reward	Value-based resampling & advantage clipping
EMPG (Wang et al., 11 Sep 2025)	Intrinsic "clarity" bonus	Entropy-based scaling

5. Hierarchical and Chunked Policy Structures for Long-Horizon Credit Assignment

Hierarchical and chunk-based policy decompositions can further reduce sparse-gradient effects by delivering temporally abstracted or semi-dense signal:

Option-based Hierarchical PG: The hierarchical average-reward policy gradient theorem (Dharmavaram et al., 2019) expresses gradients w.r.t. intra-option policies and termination criteria in terms of steady-state distributions and advantage/value functions for each level. By propagating expected return and advantage through multiple abstraction layers, the agent mitigates the effect of extreme reward sparsity and delayed credit assignment, as shown in navigation and pickup-and-deliver gridworlds.
Chunked Policy Architectures: AC3 (Actor-Critic for Continuous Chunks) produces high-dimensional action bundles, using intra-chunk $n$ -step returns for the critic and learning exclusively from "successful" trajectories for the actor (Yang et al., 15 Aug 2025). A self-supervised intrinsic reward at chunk-aligned anchors further densifies feedback. The result is effective, stable optimization of long-horizon robotic manipulation tasks with sparse extrinsic rewards.

6. Empirical Benchmarks and Performance Implications

Sparse-gradient optimization schemes have demonstrated significant performance improvements over conventional baselines:

HT-SPG outperforms Gaussian PG and variance-reduction baselines, attaining 2–5× faster convergence and higher asymptotic returns in 1D Mario, Pathological Mountain Car, and Sparse MuJoCo Hopper-v2 (Chakraborty et al., 2022).
Predictive-coding reward shaping approaches approach or surpass hand-shaped potentials in discrete mazes and continuous control, with CPC embedding-based shaping nearly saturating success rates and reducing variance (Lu et al., 2019).
TOPO consistently outperforms PPO, SIL, and other RL baselines on discrete (Key-Door-Treasure) and continuous MuJoCo (SparseHalfCheetah/Hopper) tasks (Wang et al., 2024).
AC3 achieves top success rates on 22/25 robotic manipulation tasks, converges within 10–20k steps, and runs at triple the inference speed of comparable approaches (Yang et al., 15 Aug 2025).
In LLM-based tasks, ReDit achieves target performance in 10% of the training steps required by vanilla GRPO, and EMPG yields substantial improvements (+3–8 points absolute) on WebShop, ALFWorld, and Deep Search, with greater sample efficiency and stability (Wei et al., 23 Jun 2025, Wang et al., 11 Sep 2025).

7. Algorithmic and Theoretical Insights

The collective body of work on sparse-gradient optimization for long-horizon reward maximization yields several theoretical and methodological advances:

Heavy-tailed distributions provide an "automatic exploration prior," guaranteeing nonzero probability of far-field actions at every step (Chakraborty et al., 2022).
Dense guidance via latent-shaping, MMD penalties, or learned proxy rewards transforms a sparse-terminal or episodic-reward problem into a continuous surrogate optimization, eliminating local flatness and accelerating learning (Lu et al., 2019, Wang et al., 2024, Ren et al., 2021).
Variance reduction and gradient stabilization techniques—from hierarchical option critics and Monte Carlo decomposition to dithering and entropy scaling—are essential for taming the inherently high-variance updates that arise in sparse-reward, long-horizon tasks (Chakraborty et al., 2022, Wei et al., 23 Jun 2025, Dharmavaram et al., 2019, Wang et al., 11 Sep 2025).
The efficacy of progressive or curriculum-shaped rewards is demonstrated empirically, with significant gains in convergence speed and stability in LLM and agentic RL domains (Zhuang et al., 8 Dec 2025).

These findings establish the centrality of surrogate dense signals, robust exploration strategies, and algorithmic stabilization mechanisms for effective optimization of long-horizon, sparse-reward objectives in modern reinforcement learning and sequence-level policy optimization.