SLOPE: Shaping Landscapes in Sparse MBRL

Updated 10 February 2026

The paper introduces SLOPE, which replaces scalar reward regression with potential-based reward shaping to create a gradient for exploration in sparse reward settings.
It employs optimistic distributional regression with a quantile-weighted cross-entropy loss to counteract value underestimation and flattening effects in sparse data.
Empirical benchmarks demonstrate that SLOPE significantly enhances success rates and sample efficiency across varied reward regimes including sparse, semi-sparse, and dense tasks.

Shaping Landscapes with Optimistic Potential Estimates (SLOPE) is a framework designed to address the challenges of model-based reinforcement learning (MBRL) in environments where reward signals are sparse, greatly impeding effective planning and exploration. SLOPE replaces traditional scalar reward regression with the construction of informative potential landscapes using potential-based reward shaping (PBRS) combined with optimistic distributional value estimation. This synergy ensures that learning agents acquire sufficient “gradient” for exploration even when external rewards are nearly absent, thus enabling reliable performance across sparse, semi-sparse, and dense reward regimes (Li et al., 3 Feb 2026).

1. Problem Motivation and Sparse Reward Limitations

In MBRL, the agent operates within a Markov Decision Process (MDP) $\mathcal{M} = (\mathcal{S}, \mathcal{A}, r, \mathcal{P}, \gamma)$ , where $r(s, a) \in \{0, 1\}$ is a sparse success signal, typically emitted only upon meeting a terminal goal condition. Standard MBRL frameworks learn a world model parametrized by encoder $h_\theta$ , latent dynamics $d_\theta$ , reward $R_\theta$ , value $Q_\theta$ , and use planning (e.g., with MPPI) to optimize the policy $\pi_\theta$ . However, with sparse rewards, regressors $R_\theta(s, a)$ trained on observed data yield near-zero predictions almost everywhere, leading to a flat landscape in which the return $J(\tau)$ for almost any imagined trajectory collapses to near zero. This absence of informative gradients results in arbitrary and inefficient action selection during planning, as illustrated in Figure 1 of (Li et al., 3 Feb 2026).

2. Potential-Based Reward Shaping Foundation

Drawing on PBRS, SLOPE introduces a shaped reward signal to address the “flat landscape” issue. For any bounded potential function $\Phi: \mathcal{S} \to \mathbb{R}$ , the shaped reward is defined by

$\widetilde{r}(s, a, s') = r(s, a, s') + \gamma \Phi(s') - \Phi(s),$

where $\gamma$ is the discount factor. As shown by Ng et al. (1999), this shaping leaves the optimal policy invariant in the transformed MDP $\widetilde{\mathcal{M}} = (\mathcal{S}, \mathcal{A}, \widetilde{r}, \mathcal{P}, \gamma)$ . SLOPE instantiates $\Phi$ as the agent’s own shaped value estimate,

$\Phi(s) = \eta \max_a Q_\theta(s, a),$

with $\eta > 0$ as a hyperparameter. This creates a smooth scalar field increasing toward regions of anticipated task success. The difference $\gamma \Phi(s') - \Phi(s)$ generates a continuous, “uphill” guide for planning even in regions far from any observed reward, which contrasts starkly with the conventional flat reward landscape [(Li et al., 3 Feb 2026), Sec. 4.1].

3. Optimistic Distributional Regression

Mean regression on sparse rewards underestimates the potential for high returns, leaving the shaped landscape nearly flat in practice. SLOPE introduces optimistic distributional regression to push value estimates toward a quantile-based upper bound:

Value $Q_\theta(s, a)$ is represented as a categorical distribution over $M$ bins (as in TD-MPC2), with the mean $\mathbb{E}[Q_\theta(s, a)]$ as the expected value.
For training, the target is set as

$y = \widetilde{r}(s, a) + \gamma \max_{a'} \mathbb{E}[Q_{\bar{\theta}}(s', a')],$

where $\bar{\theta}$ is a slow-moving target network.

$y$ is projected into a “soft two-hot” distribution $\Psi(y)$ over the bins.
The Quantile-weighted Cross-Entropy (QCE) loss is defined as

$\mathcal{L}_{\text{QCE}} = \mathbb{E}_{(s, a) \sim \mathcal{D}} \left[ w(\delta) \mathrm{CE}(\Psi(y), Q_\theta(s, a)) \right],$

with

$\delta = \mathbb{E}[Q_\theta(s, a)] - y,\qquad w(\delta) = \tau \cdot \mathbf{1}_{\{\delta < 0\}} + (1-\tau) \cdot \mathbf{1}_{\{\delta \ge 0\}},$

where $\tau \in (0.5, 1]$ is the optimism quantile coefficient.

The QCE loss thus weights underestimation errors ( $\delta < 0$ ) more heavily, causing $Q_\theta(s, a)$ to approach the $\tau$ -th quantile of the return distribution rather than its mean. For convergence, if $\eta < \frac{1-\gamma}{1+\gamma}$ , the Bellman operator for the shaped MDP is a contraction, ensuring stability of the value iteration process [(Li et al., 3 Feb 2026), Sec. 4.2; Prop. 4.2].

4. Planning and Training Workflow

SLOPE’s algorithm integrates PBRS and optimistic regression into every stage of model-based planning:

Warm-Start and Pretraining: Policy pretraining is conducted via behavior cloning from demonstrations, followed by data collection with the cloned policy ( $\pi_{\rm BC}$ ) to seed the world model (encoder $h_\theta$ , dynamics $d_\theta$ , reward $R_\theta$ , value $Q_\theta$ ).
Interactive MBRL Loop: In each iteration:
- Rollouts are generated using MPPI planning, scored by the shaped reward:
$J(\tau) = \sum_{h=0}^{H-1} \gamma^h \widetilde{r}(s_{t+h}, a_{t+h}) + \gamma^H Q_\theta(s_{t+H}, a_{t+H}).$

MPPI’s sampling aggressively steers rollouts along “uphill” directions on the potential landscape. - Transitions are accumulated in the replay buffer, and joint updates of world model components (dynamics, reward, value, policy) are performed using loss terms $\mathcal{L}_{\text{dyn}}$ , $\mathcal{L}_R$ , $\mathcal{L}_{\text{QCE}}$ , and $\mathcal{L}_\pi$ with demonstration and on-policy data sampled at a 1:1 ratio.

In regions remote from goal states, the potential shaping signal $\gamma \Phi(s') - \Phi(s)$ remains positive only when transitions move “up” the landscape—MPPI thus learns to select actions aligned with these gradients, providing exploration incentives not present in unshaped regimes [(Li et al., 3 Feb 2026), Sec. 4.3; Alg. 1 & 2].

5. Implementation and Hyperparameter Considerations

Key architectural and hyperparameter choices include a latent dimension of 512 and two-hot binning with $M=51$ for the categorical value distribution. The shaping weight is set as $\eta=1.0$ (empirically best, despite theory suggesting $\eta < 0.025$ ), and $\tau=0.55$ governs optimism in the QCE loss. To reduce overestimation bias, the current value for shaping ( $\Phi(s)$ ) is computed as the ensemble average across Q-networks, while the next-state term uses the ensemble minimum.

All components of the model are updated synchronously following a joint loss

$\mathcal{L} = \mathcal{L}_{\text{dyn}} + \alpha_R \mathcal{L}_R + \alpha_Q \mathcal{L}_{\text{QCE}} + \alpha_\pi \mathcal{L}_\pi.$

Demonstration and on-policy samples are mixed with a demo-sampling ratio of 50% [(Li et al., 3 Feb 2026), Tab. 5 & 6; Sec. 4.3].

6. Empirical Benchmarks and Findings

SLOPE was systematically evaluated on over 30 tasks spanning five benchmarks:

Sparse-only: ManiSkill3 (5 tasks), Meta-World sparse subset (10 tasks), Robosuite (3 tasks), Adroit (2 tasks).
Semi-sparse: Stage-decomposed task variants (for DEMO³ ablation).
Dense: DeepMind Control Suite (8 tasks) to test generality.

Key empirical outcomes include:

SLOPE, combined with TD-MPC2 or Dreamerv3 backbones, substantially surpasses these baselines (and prior sparse-MBRL approaches like MoDem and DEMO³) in both success rate and sample efficiency, especially in fully sparse regimes (Figs. 5–7).
In real-world robot manipulation (Press Button, Push Cube, Grasp Cube), SLOPE attains up to 65% success, outperforming prior methods that fail or require heavy intervention (Tab. 2).
Ablation studies demonstrate that removing shaping (“w/o shaping”) collapses performance to flat returns, and reverting to standard (mean-based) distributional regression (“w/o ODLS” with $\tau = 0.5$ ) leads to flat potentials and delayed learning (Fig. 8).
Sensitivity analysis confirms that moderate optimism ( $\tau \approx 0.55$ ) and strong shaping ( $\eta=1.0$ ) are practically vital, and that SLOPE’s task-aligned landscape consistently outperforms generic shaping baselines on challenging sparse tasks (Figs. 10, 11) [(Li et al., 3 Feb 2026), Sec. 5].

7. Significance and Theoretical Guarantees

SLOPE offers a principled solution to the key limitations of MBRL in sparse reward environments by systematically transforming flat scalar rewards into informative potential landscapes that retain the original optimal policy (Ng et al., 1999). Optimistic distributional regression mitigates value underestimation, allowing potential landscapes to reflect the possibility of rare future success. The formalism of PBRS ensures policy invariance, while proper calibration of optimism and shaping strength grounds empirical efficiency and stability.

A plausible implication is that SLOPE’s integration of planning-oriented shaping and distributional optimism could generalize to other model-based and model-free RL domains where exploration bottlenecks arise from uninformative reward landscapes. By leveraging the joint optimization of learned components and asymptotically optimistic value distributions, SLOPE consolidates exploration and exploitation, providing robust sample efficiency and policy improvement across a range of practical and simulated settings (Li et al., 3 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

From Scalar Rewards to Potential Trends: Shaping Potential Landscapes for Model-Based Reinforcement Learning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Shaping Landscapes with Optimistic Potential Estimates (SLOPE).