- The paper introduces IRPO, an algorithm that integrates multiple intrinsic rewards into policy optimization to enhance exploration in sparse-reward environments.
- IRPO uses an actor-critic framework with exploratory policies, Jacobian accumulation, and trust-region updates to propagate informative learning signals.
- Empirical results show that IRPO outperforms hierarchical RL and reward-augmentation strategies in both performance and sample efficiency across various benchmarks.
Intrinsic Reward Policy Optimization for Sparse-Reward Environments
Motivation and Context
Reinforcement learning (RL) in sparse-reward environments poses critical challenges for effective policy learning due to inefficient exploration and poor credit assignment. Standard exploration mechanisms such as action or parameter noise injection have fundamental limitations: they fail to diversify agent behaviors sufficiently, especially in high-dimensional or continuous tasks. Augmenting the reward signal with intrinsic motivation improves exploration but introduces instability in attributing reward to true task accomplishment, while hierarchical RL—although mitigating credit assignment issues and scaling exploration—suffers from sample inefficiency and sub-optimal coarse temporal abstraction.
The paper introduces the Intrinsic Reward Policy Optimization (IRPO) algorithm to address these limitations by directly integrating multiple intrinsic rewards into policy optimization, circumventing the need for subpolicy pretraining and enabling more informative learning signal propagation in sparse-reward domains.
Algorithmic Framework
IRPO is architected within an actor-critic setup featuring multiple exploratory policies, each associated with both an intrinsic and an extrinsic critic. Core operations of IRPO can be outlined as:
- Exploratory Policy Updates: Starting from the current base policy, K exploratory policies are instantiated, each optimized using distinct intrinsic reward functions via repeated gradient updates. Intrinsic critics guide these updates, while extrinsic critics accumulate information about task performance.
- Jacobian Accumulation: The algorithm tracks parameter transitions across exploratory updates, storing Jacobians that facilitate effective backpropagation of learning signals.
- Base Policy Update via IRPO Gradient: Gradients of exploratory policies with respect to extrinsic rewards are transported back to update the base policy using the chain rule, producing the IRPO gradient. This surrogate gradient is weighted according to exploratory performance, modulated by temperature hyperparameter T, and applied in a trust-region update to control policy divergence.
The bi-level optimization paradigm allows information from policies exploring under different intrinsic motivations to augment the base policy update, even when true gradients are uninformative due to extreme reward sparsity.
Theoretical Analyses
The authors rigorously show that the standard policy gradient vanishes as reward sparsity increases. Under defined assumptions (bounded reward probability, bounded log-gradient), the l2​ norm of the policy gradient approaches zero, impeding learning progress.
IRPO addresses this degeneracy by constructing a set of "reachable" policies resulting from N exploratory updates under various intrinsic rewards. If the set of reachable exploratory policies includes the optimal policy, IRPO is theoretically assured of optimality. However, practical realizability depends on the richness of the intrinsic reward set, amount of exploratory updates, and the capacity of the policy class.
Empirical Results
Across a suite of benchmark discrete and continuous navigation environments (Maze-v1/v2, FourRooms, PointMaze-v1/v2, FetchReach, and AntMaze variants), IRPO achieves the highest converged performance and lowest evaluation variance relative to key baselines: Hierarchical RL with Laplacian-based intrinsic rewards, Distributional Random Network Distillation (DRND), Parameter Space Noise Exploration (PSNE), PPO, and TRPO.
- Performance: IRPO attains optimality in discrete environments and near-optimality in continuous settings, while maintaining robustness to increased maze complexity. Hierarchical RL underperforms due to temporal abstraction limitations; reward-augmentation methods like DRND and direct optimization methods (PSNE, PPO, TRPO) lag in both sample complexity and converged return, especially as task difficulty scales.
- Sample Efficiency: IRPO outperforms HRL and DRND in sample complexity, requiring fewer interactions per episode for convergence. Baselines like PSNE and PPO sometimes exhibit lower sample complexity but show dramatically lower asymptotic performance than IRPO.
- Ablations: Trust-region updates stabilize training and reduce output variance. Performance is sensitive to the number of exploratory updates N—too low impedes exploration, while too high increases sample complexity without marginal gain. Importance sampling as an alternative to backpropagation yields poor results due to high variance, corroborating previous findings on the instability of IS-based policy gradient corrections.
- Intrinsic Reward Robustness: IRPO demonstrates moderate resilience to random intrinsic reward functions, maintaining superior performance over HRL with random rewards even as variance increases.
Practical and Computational Implications
IRPO's additional computational demands (multiple critic learning, Jacobian calculations, trust-region updates) are shown to be manageable, with mean wall-clock time differing only modestly from standard PPO implementations. The algorithm leverages vector-Jacobian product acceleration via automatic differentiation frameworks, mitigating overhead growth.
IRPO sits at the intersection of several major RL themes:
- Parameter space and action space noise have been extensively studied for exploration but falter in sparse, high-dimensional spaces and introduce unwanted variance [Plappert et al., 2018].
- Uncertainty-driven exploration (e.g., count-based, prediction error) enhances behavioral diversity but suffers from intrinsic-extrinsic reward scaling and nonstationarity [Bellemare et al., 2016; Burda et al., 2018]. Credit assignment becomes more difficult as endogenous reward signals dominate policy updates.
- Hierarchical RL leveraging Laplacian and successor representations effectively decomposes environments but sacrifices fine-grained control and worsens sample efficiency [Machado et al., 2017a; Gomez et al., 2023].
Implications and Future Directions
The proposed IRPO framework substantially advances direct optimization for extrinsic task reward in sparse settings, without hierarchical abstraction bottlenecks or exploration-credit assignment tradeoff typical of reward augmentation strategies. Practically, IRPO is well positioned for adaptation to increasingly complex RL tasks, including continuous control, dexterous manipulation, and tasks with mixed or temporally delayed reward signals.
Theoretically, further analysis may be required to tighten guarantees on recoverable optimality in broader function classes, possibly by expanding the class of intrinsic rewards or dynamically adjusting the number of exploratory updates. There is considerable scope for integrating alternative intrinsic motivation functions (e.g., empowerment, novelty search, information gain) and deploying IRPO in non-navigation settings.
Conclusion
IRPO delivers a policy optimization methodology leveraging intrinsic rewards to unearth informative gradients in sparse-reward domains, bypassing sample inefficiency and sub-optimality of hierarchical abstraction while maintaining robust performance and computational tractability. It establishes a foundation for refined RL algorithmic development and further exploration of intrinsic-extrinsic learning signal interplay in complex task environments.
(2601.21391)