Geometry of Neural Reinforcement Learning in Continuous State and Action Spaces

Published 28 Jul 2025 in cs.LG and cs.AI | (2507.20853v1)

Abstract: Advances in reinforcement learning (RL) have led to its successful application in complex tasks with continuous state and action spaces. Despite these advances in practice, most theoretical work pertains to finite state and action spaces. We propose building a theoretical understanding of continuous state and action spaces by employing a geometric lens to understand the locally attained set of states. The set of all parametrised policies learnt through a semi-gradient based approach induces a set of attainable states in RL. We show that the training dynamics of a two-layer neural policy induce a low dimensional manifold of attainable states embedded in the high-dimensional nominal state space trained using an actor-critic algorithm. We prove that, under certain conditions, the dimensionality of this manifold is of the order of the dimensionality of the action space. This is the first result of its kind, linking the geometry of the state space to the dimensionality of the action space. We empirically corroborate this upper bound for four MuJoCo environments and also demonstrate the results in a toy environment with varying dimensionality. We also show the applicability of this theoretical result by introducing a local manifold learning layer to the policy and value function networks to improve the performance in control environments with very high degrees of freedom by changing one layer of the neural network to learn sparse representations.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper establishes that neural RL agents using a two-layer NN policy yield attainable state sets constrained to a low-dimensional manifold bounded by the linear function of the action space.
It leverages a linearized policy model under the NTK regime to theoretically predict and empirically confirm the reduced intrinsic dimensionality in continuous control environments.
Empirical results indicate that incorporating sparse representations in RL architectures enhances sample efficiency and performance in high-dimensional control tasks.

Geometry and Manifold Structure in Neural Reinforcement Learning with Continuous States and Actions

Introduction and Motivation

This work addresses a fundamental gap in the theoretical understanding of reinforcement learning (RL) in continuous state and action spaces. While empirical advances have enabled RL agents to solve high-dimensional control tasks, most theoretical analyses remain restricted to finite or discrete domains. The paper develops a geometric framework to analyze the set of states attainable by neural network (NN) policies trained via policy gradients in continuous, deterministic environments. The central claim is that, under certain conditions, the set of attainable states forms a low-dimensional manifold whose dimension is upper bounded by a linear function of the action space dimension, independent of the ambient state space dimension. This result is both theoretically novel and empirically validated, with implications for the design and analysis of RL algorithms in high-dimensional domains.

Theoretical Framework

The analysis is grounded in continuous-time Markov decision processes (MDPs) with deterministic transitions. The state evolution is governed by control-affine dynamics: $\dot{s}_t = g(s) + \sum_{i=1}^{d_a} h_i(s) a_i,$ where $g$ and $h_i$ are smooth functions, $d_s$ is the state dimension, and $d_a$ is the action dimension. The policy is parameterized by a two-layer, wide neural network with GeLU activation, and the training is performed via semi-gradient policy updates.

A key technical device is the use of a linearized approximation of the NN policy in the infinite-width limit, following the neural tangent kernel (NTK) regime. The policy is approximated as: $f^{\text{lin}}(s; W) = f(s; W^0) + \nabla_\theta f(s; \theta)|_{\theta=W^0} (W - W^0),$ where $W^0$ is the random initialization. This linearization enables tractable analysis of the policy's effect on the state space.

The main theoretical result is that, for small time intervals and under mild regularity assumptions, the set of states attainable by such policies is concentrated around a manifold of dimension at most $2d_a + 1$ , regardless of the ambient state dimension $d_s$ . The proof leverages Lie series expansions of the system's flow, stochastic process convergence in the infinite-width limit, and concentration of measure arguments.

Empirical Validation

Manifold Dimensionality of Attainable States

The theoretical upper bound on the manifold dimension is empirically validated in several MuJoCo continuous control environments. The intrinsic dimension of the set of states visited by trained agents is estimated using the method of Facco et al. (2017), which is robust to non-uniform density and curvature.

Figure 1: Estimated dimensionality of the attainable states, in blue, is far below $d_s$ (green line) and also below $2 d_a +1$ (red line) for four tasks, estimated using the method by Facco et al.

Across all tested environments, the estimated intrinsic dimension is consistently below the theoretical upper bound, and significantly lower than the ambient state dimension. This provides strong empirical support for the main claim.

Validity of the Linearized Policy Model

The paper also examines the fidelity of the linearized policy model as an approximation to canonical two-layer NNs. By comparing the returns achieved by DDPG agents using canonical versus linearized policies at varying network widths, it is shown that the difference in returns vanishes as the width increases.

Figure 2: The canonical policy (in red) tracks the returns for linearised policy (in blue) at higher widths ( $\log_2 n > 15$ ).

This justifies the use of the linearized model for theoretical analysis in the overparameterized regime.

Architectural Comparisons

The performance of single hidden layer GeLU networks is compared to standard multi-layer ReLU architectures in DDPG across several environments.

Figure 3: Comparison of single hidden layer with GeLU activation (blue) and multiple hidden layer with ReLU activation (red) architectures for DNNs.

The results indicate that the simplified architecture used for theoretical tractability does not significantly degrade empirical performance.

Practical Implications: Sparse Representations and RL Performance

Building on the manifold hypothesis, the paper explores the practical benefits of explicitly encouraging sparse, low-dimensional representations in policy and value networks. By replacing a fully connected layer with a sparsification layer (as in the CRATE framework), the authors demonstrate improved performance in high-dimensional control tasks using the Soft Actor-Critic (SAC) algorithm.

Figure 4: Discounted returns of SAC (blue) and sparse SAC (red) $\alpha_{\pi}$ .

The sparse variant achieves higher returns, especially in environments where the standard SAC agent fails to learn effectively. This supports the claim that exploiting the emergent low-dimensional structure can yield practical gains in sample efficiency and final performance.

The computational overhead of the sparsification layer is also quantified.

Figure 5: Steps per second for SAC (blue) and sparse SAC (red) as training progresses, showing a moderate decrease in throughput for the sparse variant.

While the sparse implementation reduces steps per second, the wall-clock cost is not prohibitive given the performance improvements.

Discussion and Theoretical Implications

The main theoretical contribution is the explicit connection between the geometry of the attainable state space and the action space dimension in neural RL. This result provides a rigorous foundation for the manifold hypothesis in RL, which has previously been assumed but not proven. The analysis also clarifies the role of overparameterization and the NTK regime in shaping the learning dynamics and the structure of the data generated by RL agents.

The findings have several implications:

Sample Complexity: Since the effective state space is low-dimensional, the sample complexity of RL algorithms may depend more on $d_a$ than $d_s$ , suggesting new directions for theory and algorithm design.
Representation Learning: Explicitly learning or exploiting low-dimensional representations can improve learning efficiency and generalization, as demonstrated empirically.
Extension to Stochastic and High-Dimensional Settings: The current analysis assumes deterministic transitions and fixed $d_s$ . Extending the theory to stochastic environments and the regime where $d_s \to \infty$ remains an open challenge.

Future Directions

Potential avenues for further research include:

Extending the geometric analysis to deeper networks, alternative activation functions, and stochastic environments.
Developing RL algorithms that adaptively exploit the emergent manifold structure for improved exploration and credit assignment.
Investigating the interplay between the geometry of the attainable state manifold and the expressivity or generalization properties of RL agents in more complex domains.

Conclusion

This work establishes a rigorous geometric perspective on neural RL in continuous domains, demonstrating that the set of attainable states under wide, two-layer NN policies is confined to a low-dimensional manifold whose dimension is controlled by the action space. Theoretical results are corroborated by empirical evidence across standard benchmarks, and practical benefits are realized by incorporating sparse representation learning into RL architectures. These insights advance the theoretical understanding of RL in high-dimensional settings and suggest concrete strategies for improving algorithmic performance by leveraging the intrinsic geometry of the problem.

Markdown Report Issue