- The paper demonstrates that instability in hyperbolic deep RL arises from gradient pathologies and norm explosion near the manifold boundaries.
- It introduces HYPER++ which employs a categorical value loss and RMSNorm with learnable scaling to achieve up to 30% reward improvements and faster training.
- The study underscores the importance of aligning loss functions with geometric regularization to stabilize actor-critic training in hyperbolic spaces.
A Rigorous Analysis and Practical Stabilization of Hyperbolic Deep Reinforcement Learning
Introduction
The application of hyperbolic geometry to deep reinforcement learning (RL) promises inductive biases that naturally encode hierarchies and relational structures inherent in sequential decision problems. While hyperbolic representations, particularly in the Poincaré Ball and Hyperboloid models, have demonstrated theoretical and empirical superiority for tree-structured data, their deployment in deep RL has been severely hindered by optimization difficulties and instability during actor-critic training. This paper provides a systematic analysis of the sources of instability in hyperbolic RL agents and introduces HYPER++, an RL agent architecture that achieves stable, efficient, and performant training in hyperbolic latent spaces (2512.14202).
Most RL environments, such as chess-like games or environments with irreversible state transitions, manifest hierarchical and exponentially expanding state spaces. Euclidean embeddings inadequately capture these properties due to their polynomial volume growth. Hyperbolic geometry, with its exponential expansion, is better matched to RL’s dynamics and has shown empirical promise in preceding works. However, attempts to combine hyperbolic representation learning with state-of-the-art RL algorithms (e.g., PPO, DDQN) have encountered degenerate policies, gradient instability, and severe trust-region violations. Existing remedies, such as spectral normalization or ad-hoc norm clipping, induced undesirable restrictions on model expressivity and incurred computational overhead.
Gradient Pathologies in Hyperbolic Architectures
The authors present an in-depth differential analysis of the operations mapping Euclidean features into hyperbolic manifolds, focusing on the exponential map and multinomial logistic regression (MLR) layers. Both the Poincaré Ball and Hyperboloid models are shown to suffer destabilization from large-norm embeddings. In the Poincaré Ball, the conformal factor’s gradient explodes near the boundary, causing vanishing or blowing-up gradients during backpropagation, with the instability coupled to the squared inverse of the feature norm. The Hyperboloid avoids this specific pathology due to the absence of a conformal factor but is subject to ill-conditioning via the exponential growth of sinh and cosh in the exponential map. It is emphasized that regularizing norm growth in the final Euclidean layers—before the mapping to hyperbolic space—is essential for stability; prior approaches applying spectral normalization solely on the last linear layer are mathematically and empirically insufficient.
The HYPER++ Architecture
HYPER++ addresses RL nonstationarity, norm explosion, and geometric instability via three critical modifications:
- Categorical Value Loss: The critic is trained using a classification-oriented categorical distributional loss (HL-Gauss) instead of standard regression. This aligns the loss landscape to the geometric properties of hyperbolic MLR and robustly smooths critic updates.
- RMSNorm with Feature Scaling: A dimension-independent RMSNorm layer is applied to the Euclidean encoder’s outputs, followed by a learnable multiplicative feature scaling. This preserves model capacity while guaranteeing analytic bounds on feature norms and preempting the curse of dimensionality. RMSNorm’s lack of mean-centering avoids disrupting the hyperbolic hierarchy structure.
- Preferential Use of the Hyperboloid Model: The hyperboloid formulation is used for the terminal network layers, avoiding conformal factor instabilities and further smoothing gradients when combined with the above regularization.
These interventions are mutually synergistic and collectively ensure both actor and critic exhibit stable training dynamics (i.e., entropy retention, low update KL, minimal trust-region boundary clipping, and bounded gradient norms).
Experimental Results
On the ProcGen suite and five challenging Atari games under PPO and DDQN, HYPER++ achieves up to 30% higher normalized test rewards and 30% reductions in wall-clock training time compared to state-of-the-art baselines. It dominates both standard Euclidean and prior hyperbolic approaches (notably Hyper+S-RYM) in aggregate metrics (mean, median, IQM, and optimality gap), demonstrating the effectiveness of the proposed stabilizations. Ablation studies establish:
- Removing RMSNorm or learnable scaling leads to catastrophic performance degradation and gradient vanishing.
- The categorical loss is particularly beneficial for hyperbolic critics but not for Euclidean agents.
- Substituting spectral normalization fails to guarantee stability unless aggressively applied to all encoder layers, which is computationally undesirable.
Robustness to target network update schemes and latent dimensionality is established, and the architecture shows transferability beyond PPO.
Theoretical and Practical Implications
The analytic derivations clarify that hyperbolic instability is a consequence of the interplay between RL-induced nonstationarity and the geometric sensitivity of the exponential map's Jacobian. This compels a rethinking of regularization strategy in geometric deep RL: applying norm control before manifold projection, in a manner cognizant of curvature and dimension, is mandatory. The operationalization of the categorical value objective further advocates for the alignment of RL loss functions and geometric model output.
Practically, the results demonstrate that hyperbolic representations can offer robust, scalable, and high-performing RL agents—provided that geometric, architectural, and loss-level details are tuned in concert.
Future Directions
This work confines itself to the optimization-centric aspects of hyperbolic RL and does not address questions about representation interpretability, task-specific suitability of hyperbolic spaces, or downstream effects on generalization and transfer. It is plausible, given HYPER++’s improved scalability and stability, that hyperbolic RL will see stronger adoption in domains with explicit or implicit state hierarchies (e.g., natural language, hierarchical robotics, program synthesis). Future work should also analyze the interaction between geometric regularization and advances in credit assignment, exploration, and multi-agent RL.
Conclusion
This paper provides a comprehensive account of the failure modes in hyperbolic deep RL, analytically pinpoints the sources of instability, and offers a principled, computationally efficient solution in HYPER++. The demonstrated gains in sample efficiency, numerical stability, and empirical performance position this work as a foundation for further advances at the intersection of non-Euclidean geometry and RL (2512.14202).