SLOPE: Shaping Landscapes in Sparse MBRL
- The paper introduces SLOPE, which replaces scalar reward regression with potential-based reward shaping to create a gradient for exploration in sparse reward settings.
- It employs optimistic distributional regression with a quantile-weighted cross-entropy loss to counteract value underestimation and flattening effects in sparse data.
- Empirical benchmarks demonstrate that SLOPE significantly enhances success rates and sample efficiency across varied reward regimes including sparse, semi-sparse, and dense tasks.
Shaping Landscapes with Optimistic Potential Estimates (SLOPE) is a framework designed to address the challenges of model-based reinforcement learning (MBRL) in environments where reward signals are sparse, greatly impeding effective planning and exploration. SLOPE replaces traditional scalar reward regression with the construction of informative potential landscapes using potential-based reward shaping (PBRS) combined with optimistic distributional value estimation. This synergy ensures that learning agents acquire sufficient “gradient” for exploration even when external rewards are nearly absent, thus enabling reliable performance across sparse, semi-sparse, and dense reward regimes (Li et al., 3 Feb 2026).
1. Problem Motivation and Sparse Reward Limitations
In MBRL, the agent operates within a Markov Decision Process (MDP) , where is a sparse success signal, typically emitted only upon meeting a terminal goal condition. Standard MBRL frameworks learn a world model parametrized by encoder , latent dynamics , reward , value , and use planning (e.g., with MPPI) to optimize the policy . However, with sparse rewards, regressors trained on observed data yield near-zero predictions almost everywhere, leading to a flat landscape in which the return for almost any imagined trajectory collapses to near zero. This absence of informative gradients results in arbitrary and inefficient action selection during planning, as illustrated in Figure 1 of (Li et al., 3 Feb 2026).
2. Potential-Based Reward Shaping Foundation
Drawing on PBRS, SLOPE introduces a shaped reward signal to address the “flat landscape” issue. For any bounded potential function , the shaped reward is defined by
where is the discount factor. As shown by Ng et al. (1999), this shaping leaves the optimal policy invariant in the transformed MDP . SLOPE instantiates as the agent’s own shaped value estimate,
with as a hyperparameter. This creates a smooth scalar field increasing toward regions of anticipated task success. The difference generates a continuous, “uphill” guide for planning even in regions far from any observed reward, which contrasts starkly with the conventional flat reward landscape [(Li et al., 3 Feb 2026), Sec. 4.1].
3. Optimistic Distributional Regression
Mean regression on sparse rewards underestimates the potential for high returns, leaving the shaped landscape nearly flat in practice. SLOPE introduces optimistic distributional regression to push value estimates toward a quantile-based upper bound:
- Value is represented as a categorical distribution over bins (as in TD-MPC2), with the mean as the expected value.
- For training, the target is set as
where is a slow-moving target network.
- is projected into a “soft two-hot” distribution over the bins.
- The Quantile-weighted Cross-Entropy (QCE) loss is defined as
with
where is the optimism quantile coefficient.
The QCE loss thus weights underestimation errors () more heavily, causing to approach the -th quantile of the return distribution rather than its mean. For convergence, if , the Bellman operator for the shaped MDP is a contraction, ensuring stability of the value iteration process [(Li et al., 3 Feb 2026), Sec. 4.2; Prop. 4.2].
4. Planning and Training Workflow
SLOPE’s algorithm integrates PBRS and optimistic regression into every stage of model-based planning:
- Warm-Start and Pretraining: Policy pretraining is conducted via behavior cloning from demonstrations, followed by data collection with the cloned policy () to seed the world model (encoder , dynamics , reward , value ).
- Interactive MBRL Loop: In each iteration:
- Rollouts are generated using MPPI planning, scored by the shaped reward:
MPPI’s sampling aggressively steers rollouts along “uphill” directions on the potential landscape. - Transitions are accumulated in the replay buffer, and joint updates of world model components (dynamics, reward, value, policy) are performed using loss terms , , , and with demonstration and on-policy data sampled at a 1:1 ratio.
In regions remote from goal states, the potential shaping signal remains positive only when transitions move “up” the landscape—MPPI thus learns to select actions aligned with these gradients, providing exploration incentives not present in unshaped regimes [(Li et al., 3 Feb 2026), Sec. 4.3; Alg. 1 & 2].
5. Implementation and Hyperparameter Considerations
Key architectural and hyperparameter choices include a latent dimension of 512 and two-hot binning with for the categorical value distribution. The shaping weight is set as (empirically best, despite theory suggesting ), and governs optimism in the QCE loss. To reduce overestimation bias, the current value for shaping () is computed as the ensemble average across Q-networks, while the next-state term uses the ensemble minimum.
All components of the model are updated synchronously following a joint loss
Demonstration and on-policy samples are mixed with a demo-sampling ratio of 50% [(Li et al., 3 Feb 2026), Tab. 5 & 6; Sec. 4.3].
6. Empirical Benchmarks and Findings
SLOPE was systematically evaluated on over 30 tasks spanning five benchmarks:
- Sparse-only: ManiSkill3 (5 tasks), Meta-World sparse subset (10 tasks), Robosuite (3 tasks), Adroit (2 tasks).
- Semi-sparse: Stage-decomposed task variants (for DEMO³ ablation).
- Dense: DeepMind Control Suite (8 tasks) to test generality.
Key empirical outcomes include:
- SLOPE, combined with TD-MPC2 or Dreamerv3 backbones, substantially surpasses these baselines (and prior sparse-MBRL approaches like MoDem and DEMO³) in both success rate and sample efficiency, especially in fully sparse regimes (Figs. 5–7).
- In real-world robot manipulation (Press Button, Push Cube, Grasp Cube), SLOPE attains up to 65% success, outperforming prior methods that fail or require heavy intervention (Tab. 2).
- Ablation studies demonstrate that removing shaping (“w/o shaping”) collapses performance to flat returns, and reverting to standard (mean-based) distributional regression (“w/o ODLS” with ) leads to flat potentials and delayed learning (Fig. 8).
- Sensitivity analysis confirms that moderate optimism () and strong shaping () are practically vital, and that SLOPE’s task-aligned landscape consistently outperforms generic shaping baselines on challenging sparse tasks (Figs. 10, 11) [(Li et al., 3 Feb 2026), Sec. 5].
7. Significance and Theoretical Guarantees
SLOPE offers a principled solution to the key limitations of MBRL in sparse reward environments by systematically transforming flat scalar rewards into informative potential landscapes that retain the original optimal policy (Ng et al., 1999). Optimistic distributional regression mitigates value underestimation, allowing potential landscapes to reflect the possibility of rare future success. The formalism of PBRS ensures policy invariance, while proper calibration of optimism and shaping strength grounds empirical efficiency and stability.
A plausible implication is that SLOPE’s integration of planning-oriented shaping and distributional optimism could generalize to other model-based and model-free RL domains where exploration bottlenecks arise from uninformative reward landscapes. By leveraging the joint optimization of learned components and asymptotically optimistic value distributions, SLOPE consolidates exploration and exploitation, providing robust sample efficiency and policy improvement across a range of practical and simulated settings (Li et al., 3 Feb 2026).