Understanding and Improving Hyperbolic Deep Reinforcement Learning

Published 16 Dec 2025 in cs.LG and cs.AI | (2512.14202v1)

Abstract: The performance of reinforcement learning (RL) agents depends critically on the quality of the underlying feature representations. Hyperbolic feature spaces are well-suited for this purpose, as they naturally capture hierarchical and relational structure often present in complex RL environments. However, leveraging these spaces commonly faces optimization challenges due to the nonstationarity of RL. In this work, we identify key factors that determine the success and failure of training hyperbolic deep RL agents. By analyzing the gradients of core operations in the Poincaré Ball and Hyperboloid models of hyperbolic geometry, we show that large-norm embeddings destabilize gradient-based training, leading to trust-region violations in proximal policy optimization (PPO). Based on these insights, we introduce Hyper++, a new hyperbolic PPO agent that consists of three components: (i) stable critic training through a categorical value loss instead of regression; (ii) feature regularization guaranteeing bounded norms while avoiding the curse of dimensionality from clipping; and (iii) using a more optimization-friendly formulation of hyperbolic network layers. In experiments on ProcGen, we show that Hyper++ guarantees stable learning, outperforms prior hyperbolic agents, and reduces wall-clock time by approximately 30%. On Atari-5 with Double DQN, Hyper++ strongly outperforms Euclidean and hyperbolic baselines. We release our code at https://github.com/Probabilistic-and-Interactive-ML/hyper-rl .

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that instability in hyperbolic deep RL arises from gradient pathologies and norm explosion near the manifold boundaries.
It introduces HYPER++ which employs a categorical value loss and RMSNorm with learnable scaling to achieve up to 30% reward improvements and faster training.
The study underscores the importance of aligning loss functions with geometric regularization to stabilize actor-critic training in hyperbolic spaces.

A Rigorous Analysis and Practical Stabilization of Hyperbolic Deep Reinforcement Learning

Introduction

The application of hyperbolic geometry to deep reinforcement learning (RL) promises inductive biases that naturally encode hierarchies and relational structures inherent in sequential decision problems. While hyperbolic representations, particularly in the Poincaré Ball and Hyperboloid models, have demonstrated theoretical and empirical superiority for tree-structured data, their deployment in deep RL has been severely hindered by optimization difficulties and instability during actor-critic training. This paper provides a systematic analysis of the sources of instability in hyperbolic RL agents and introduces HYPER++, an RL agent architecture that achieves stable, efficient, and performant training in hyperbolic latent spaces (2512.14202).

Problem Formulation and Limitations of Prior Work

Most RL environments, such as chess-like games or environments with irreversible state transitions, manifest hierarchical and exponentially expanding state spaces. Euclidean embeddings inadequately capture these properties due to their polynomial volume growth. Hyperbolic geometry, with its exponential expansion, is better matched to RL’s dynamics and has shown empirical promise in preceding works. However, attempts to combine hyperbolic representation learning with state-of-the-art RL algorithms (e.g., PPO, DDQN) have encountered degenerate policies, gradient instability, and severe trust-region violations. Existing remedies, such as spectral normalization or ad-hoc norm clipping, induced undesirable restrictions on model expressivity and incurred computational overhead.

Gradient Pathologies in Hyperbolic Architectures

The authors present an in-depth differential analysis of the operations mapping Euclidean features into hyperbolic manifolds, focusing on the exponential map and multinomial logistic regression (MLR) layers. Both the Poincaré Ball and Hyperboloid models are shown to suffer destabilization from large-norm embeddings. In the Poincaré Ball, the conformal factor’s gradient explodes near the boundary, causing vanishing or blowing-up gradients during backpropagation, with the instability coupled to the squared inverse of the feature norm. The Hyperboloid avoids this specific pathology due to the absence of a conformal factor but is subject to ill-conditioning via the exponential growth of $\sinh$ and $\cosh$ in the exponential map. It is emphasized that regularizing norm growth in the final Euclidean layers—before the mapping to hyperbolic space—is essential for stability; prior approaches applying spectral normalization solely on the last linear layer are mathematically and empirically insufficient.

The HYPER++ Architecture

HYPER++ addresses RL nonstationarity, norm explosion, and geometric instability via three critical modifications:

Categorical Value Loss: The critic is trained using a classification-oriented categorical distributional loss (HL-Gauss) instead of standard regression. This aligns the loss landscape to the geometric properties of hyperbolic MLR and robustly smooths critic updates.
RMSNorm with Feature Scaling: A dimension-independent RMSNorm layer is applied to the Euclidean encoder’s outputs, followed by a learnable multiplicative feature scaling. This preserves model capacity while guaranteeing analytic bounds on feature norms and preempting the curse of dimensionality. RMSNorm’s lack of mean-centering avoids disrupting the hyperbolic hierarchy structure.
Preferential Use of the Hyperboloid Model: The hyperboloid formulation is used for the terminal network layers, avoiding conformal factor instabilities and further smoothing gradients when combined with the above regularization.

These interventions are mutually synergistic and collectively ensure both actor and critic exhibit stable training dynamics (i.e., entropy retention, low update KL, minimal trust-region boundary clipping, and bounded gradient norms).

Experimental Results

On the ProcGen suite and five challenging Atari games under PPO and DDQN, HYPER++ achieves up to 30% higher normalized test rewards and 30% reductions in wall-clock training time compared to state-of-the-art baselines. It dominates both standard Euclidean and prior hyperbolic approaches (notably Hyper+S-RYM) in aggregate metrics (mean, median, IQM, and optimality gap), demonstrating the effectiveness of the proposed stabilizations. Ablation studies establish:

Removing RMSNorm or learnable scaling leads to catastrophic performance degradation and gradient vanishing.
The categorical loss is particularly beneficial for hyperbolic critics but not for Euclidean agents.
Substituting spectral normalization fails to guarantee stability unless aggressively applied to all encoder layers, which is computationally undesirable.

Robustness to target network update schemes and latent dimensionality is established, and the architecture shows transferability beyond PPO.

Theoretical and Practical Implications

The analytic derivations clarify that hyperbolic instability is a consequence of the interplay between RL-induced nonstationarity and the geometric sensitivity of the exponential map's Jacobian. This compels a rethinking of regularization strategy in geometric deep RL: applying norm control before manifold projection, in a manner cognizant of curvature and dimension, is mandatory. The operationalization of the categorical value objective further advocates for the alignment of RL loss functions and geometric model output.

Practically, the results demonstrate that hyperbolic representations can offer robust, scalable, and high-performing RL agents—provided that geometric, architectural, and loss-level details are tuned in concert.

Future Directions

This work confines itself to the optimization-centric aspects of hyperbolic RL and does not address questions about representation interpretability, task-specific suitability of hyperbolic spaces, or downstream effects on generalization and transfer. It is plausible, given HYPER++’s improved scalability and stability, that hyperbolic RL will see stronger adoption in domains with explicit or implicit state hierarchies (e.g., natural language, hierarchical robotics, program synthesis). Future work should also analyze the interaction between geometric regularization and advances in credit assignment, exploration, and multi-agent RL.

Conclusion

This paper provides a comprehensive account of the failure modes in hyperbolic deep RL, analytically pinpoints the sources of instability, and offers a principled, computationally efficient solution in HYPER++. The demonstrated gains in sample efficiency, numerical stability, and empirical performance position this work as a foundation for further advances at the intersection of non-Euclidean geometry and RL (2512.14202).

Markdown Report Issue