Intrinsic Reward Policy Optimization for Sparse-Reward Environments

Published 29 Jan 2026 in cs.LG and cs.AI | (2601.21391v1)

Abstract: Exploration is essential in reinforcement learning as an agent relies on trial and error to learn an optimal policy. However, when rewards are sparse, naive exploration strategies, like noise injection, are often insufficient. Intrinsic rewards can also provide principled guidance for exploration by, for example, combining them with extrinsic rewards to optimize a policy or using them to train subpolicies for hierarchical learning. However, the former approach suffers from unstable credit assignment, while the latter exhibits sample inefficiency and sub-optimality. We propose a policy optimization framework that leverages multiple intrinsic rewards to directly optimize a policy for an extrinsic reward without pretraining subpolicies. Our algorithm -- intrinsic reward policy optimization (IRPO) -- achieves this by using a surrogate policy gradient that provides a more informative learning signal than the true gradient in sparse-reward environments. We demonstrate that IRPO improves performance and sample efficiency relative to baselines in discrete and continuous environments, and formally analyze the optimization problem solved by IRPO. Our code is available at https://github.com/Mgineer117/IRPO.

Abstract PDF Upgrade to Chat

Summary

The paper introduces IRPO, an algorithm that integrates multiple intrinsic rewards into policy optimization to enhance exploration in sparse-reward environments.
IRPO uses an actor-critic framework with exploratory policies, Jacobian accumulation, and trust-region updates to propagate informative learning signals.
Empirical results show that IRPO outperforms hierarchical RL and reward-augmentation strategies in both performance and sample efficiency across various benchmarks.

Intrinsic Reward Policy Optimization for Sparse-Reward Environments

Motivation and Context

Reinforcement learning (RL) in sparse-reward environments poses critical challenges for effective policy learning due to inefficient exploration and poor credit assignment. Standard exploration mechanisms such as action or parameter noise injection have fundamental limitations: they fail to diversify agent behaviors sufficiently, especially in high-dimensional or continuous tasks. Augmenting the reward signal with intrinsic motivation improves exploration but introduces instability in attributing reward to true task accomplishment, while hierarchical RL—although mitigating credit assignment issues and scaling exploration—suffers from sample inefficiency and sub-optimal coarse temporal abstraction.

The paper introduces the Intrinsic Reward Policy Optimization (IRPO) algorithm to address these limitations by directly integrating multiple intrinsic rewards into policy optimization, circumventing the need for subpolicy pretraining and enabling more informative learning signal propagation in sparse-reward domains.

Algorithmic Framework

IRPO is architected within an actor-critic setup featuring multiple exploratory policies, each associated with both an intrinsic and an extrinsic critic. Core operations of IRPO can be outlined as:

Exploratory Policy Updates: Starting from the current base policy, $K$ exploratory policies are instantiated, each optimized using distinct intrinsic reward functions via repeated gradient updates. Intrinsic critics guide these updates, while extrinsic critics accumulate information about task performance.
Jacobian Accumulation: The algorithm tracks parameter transitions across exploratory updates, storing Jacobians that facilitate effective backpropagation of learning signals.
Base Policy Update via IRPO Gradient: Gradients of exploratory policies with respect to extrinsic rewards are transported back to update the base policy using the chain rule, producing the IRPO gradient. This surrogate gradient is weighted according to exploratory performance, modulated by temperature hyperparameter $T$ , and applied in a trust-region update to control policy divergence.

The bi-level optimization paradigm allows information from policies exploring under different intrinsic motivations to augment the base policy update, even when true gradients are uninformative due to extreme reward sparsity.

Theoretical Analyses

The authors rigorously show that the standard policy gradient vanishes as reward sparsity increases. Under defined assumptions (bounded reward probability, bounded log-gradient), the $l_2$ norm of the policy gradient approaches zero, impeding learning progress.

IRPO addresses this degeneracy by constructing a set of "reachable" policies resulting from $N$ exploratory updates under various intrinsic rewards. If the set of reachable exploratory policies includes the optimal policy, IRPO is theoretically assured of optimality. However, practical realizability depends on the richness of the intrinsic reward set, amount of exploratory updates, and the capacity of the policy class.

Empirical Results

Across a suite of benchmark discrete and continuous navigation environments (Maze-v1/v2, FourRooms, PointMaze-v1/v2, FetchReach, and AntMaze variants), IRPO achieves the highest converged performance and lowest evaluation variance relative to key baselines: Hierarchical RL with Laplacian-based intrinsic rewards, Distributional Random Network Distillation (DRND), Parameter Space Noise Exploration (PSNE), PPO, and TRPO.

Performance: IRPO attains optimality in discrete environments and near-optimality in continuous settings, while maintaining robustness to increased maze complexity. Hierarchical RL underperforms due to temporal abstraction limitations; reward-augmentation methods like DRND and direct optimization methods (PSNE, PPO, TRPO) lag in both sample complexity and converged return, especially as task difficulty scales.
Sample Efficiency: IRPO outperforms HRL and DRND in sample complexity, requiring fewer interactions per episode for convergence. Baselines like PSNE and PPO sometimes exhibit lower sample complexity but show dramatically lower asymptotic performance than IRPO.
Ablations: Trust-region updates stabilize training and reduce output variance. Performance is sensitive to the number of exploratory updates $N$ —too low impedes exploration, while too high increases sample complexity without marginal gain. Importance sampling as an alternative to backpropagation yields poor results due to high variance, corroborating previous findings on the instability of IS-based policy gradient corrections.
Intrinsic Reward Robustness: IRPO demonstrates moderate resilience to random intrinsic reward functions, maintaining superior performance over HRL with random rewards even as variance increases.

Practical and Computational Implications

IRPO's additional computational demands (multiple critic learning, Jacobian calculations, trust-region updates) are shown to be manageable, with mean wall-clock time differing only modestly from standard PPO implementations. The algorithm leverages vector-Jacobian product acceleration via automatic differentiation frameworks, mitigating overhead growth.

IRPO sits at the intersection of several major RL themes:

Parameter space and action space noise have been extensively studied for exploration but falter in sparse, high-dimensional spaces and introduce unwanted variance [Plappert et al., 2018].
Uncertainty-driven exploration (e.g., count-based, prediction error) enhances behavioral diversity but suffers from intrinsic-extrinsic reward scaling and nonstationarity [Bellemare et al., 2016; Burda et al., 2018]. Credit assignment becomes more difficult as endogenous reward signals dominate policy updates.
Hierarchical RL leveraging Laplacian and successor representations effectively decomposes environments but sacrifices fine-grained control and worsens sample efficiency [Machado et al., 2017a; Gomez et al., 2023].

Implications and Future Directions

The proposed IRPO framework substantially advances direct optimization for extrinsic task reward in sparse settings, without hierarchical abstraction bottlenecks or exploration-credit assignment tradeoff typical of reward augmentation strategies. Practically, IRPO is well positioned for adaptation to increasingly complex RL tasks, including continuous control, dexterous manipulation, and tasks with mixed or temporally delayed reward signals.

Theoretically, further analysis may be required to tighten guarantees on recoverable optimality in broader function classes, possibly by expanding the class of intrinsic rewards or dynamically adjusting the number of exploratory updates. There is considerable scope for integrating alternative intrinsic motivation functions (e.g., empowerment, novelty search, information gain) and deploying IRPO in non-navigation settings.

Conclusion

IRPO delivers a policy optimization methodology leveraging intrinsic rewards to unearth informative gradients in sparse-reward domains, bypassing sample inefficiency and sub-optimality of hierarchical abstraction while maintaining robust performance and computational tractability. It establishes a foundation for refined RL algorithmic development and further exploration of intrinsic-extrinsic learning signal interplay in complex task environments.

(2601.21391)

Markdown Report Issue