Scaling Effects and Uncertainty Quantification in Neural Actor Critic Algorithms

Published 25 Jan 2026 in cs.LG and math.PR | (2601.17954v1)

Abstract: We investigate the neural Actor Critic algorithm using shallow neural networks for both the Actor and Critic models. The focus of this work is twofold: first, to compare the convergence properties of the network outputs under various scaling schemes as the network width and the number of training steps tend to infinity; and second, to provide precise control of the approximation error associated with each scaling regime. Previous work has shown convergence to ordinary differential equations with random initial conditions under inverse square root scaling in the network width. In this work, we shift the focus from convergence speed alone to a more comprehensive statistical characterization of the algorithm's output, with the goal of quantifying uncertainty in neural Actor Critic methods. Specifically, we study a general inverse polynomial scaling in the network width, with an exponent treated as a tunable hyperparameter taking values strictly between one half and one. We derive an asymptotic expansion of the network outputs, interpreted as statistical estimators, in order to clarify their structure. To leading order, we show that the variance decays as a power of the network width, with an exponent equal to one half minus the scaling parameter, implying improved statistical robustness as the scaling parameter approaches one. Numerical experiments support this behavior and further suggest faster convergence for this choice of scaling. Finally, our analysis yields concrete guidelines for selecting algorithmic hyperparameters, including learning rates and exploration rates, as functions of the network width and the scaling parameter, ensuring provably favorable statistical behavior.

Abstract PDF Upgrade to Chat

Summary

The paper derives statistical bias and variance decompositions for actor and critic outputs using asymptotic expansions with respect to network width and scaling exponent beta.
It presents deterministic and stochastic contributions that guide hyperparameter tuning, ensuring rapid convergence and reduced output variance.
Empirical results confirm that tuning beta near one minimizes initialization effects, leading to stable and efficient policy convergence.

Scaling Effects and Uncertainty Quantification in Neural Actor-Critic Algorithms

Overview of Objectives and Contributions

The paper "Scaling Effects and Uncertainty Quantification in Neural Actor Critic Algorithms" (2601.17954) provides an analytical and empirical study of the asymptotic scaling regimes for neural actor-critic reinforcement learning algorithms, focusing on shallow networks with width $N \to \infty$ . The primary contributions are:

The derivation and rigorous characterization of the statistical bias and variance for actor and critic neural outputs across a continuum of scaling exponents $\beta \in (\frac{1}{2}, 1)$ .
Construction of asymptotic expansions for the network outputs, which detail both deterministic and stochastic contributions, thus enabling a fine-grained analysis of estimator uncertainty.
Identification of concrete, scaling-law-based prescriptions for algorithmic hyperparameters (learning rates, exploration rates) that guarantee controlled statistical behavior.
Empirical demonstration of the impact of the scaling parameter $\beta$ on the speed of convergence (bias decay) and estimator robustness (variance reduction).

Theoretical Analysis: Asymptotic Expansions and Bias-Variance Decomposition

The paper employs stochastic process and measure-theoretic analysis to demonstrate that, under specific scaling constraints on network parameters and learning rates, the outputs of shallow actor and critic networks can be expanded as:

$Q_t^N \approx Q_t^{(0)} + N^{\beta-1} Q_t^{(1)} + \cdots + N^{\frac{1}{2}-\beta} Q_t^{(n)} \ P_t^N \approx P_t^{(0)} + N^{\beta-1} P_t^{(1)} + \cdots + N^{\frac{1}{2}-\beta} P_t^{(n)}$

with each $Q_t^{(j)}$ , $P_t^{(j)}$ satisfying deterministic or, for the highest-order term, stochastic evolution equations, and the scaling constant $\beta$ controlling the order at which random fluctuations enter. Crucially, this reveals:

Bias Reduction: The non-random terms decay proportionally with training time and network width, and bias corrections scale as $N^{\max\{\beta-1, \frac{1}{2}-\beta\}}$ (minimized near $\beta = 3/4$ ).
Variance Reduction: The variance term scales as $N^{\frac{1}{2}-\beta}$ , strictly decreasing as $\beta \to 1$ . Thus, choosing $\beta$ near $1$ (mean-field regime) optimally suppresses random initialization effects.

Implications for algorithm design include the setting of learning rate and exploration schedules as explicit functions of $N$ and $\beta$ , validated by proofs and error bounds.

Large-Time Convergence and Control of Estimator Uncertainty

The convergence of the critic network to the true state-action value function and the actor to a stationary optimal policy is established under the prescribed scaling and ergodicity assumptions. The rate of convergence can be directly controlled via the exploration parameter schedule, tying hyperparameter choices to provably favorable statistical behavior. Moreover, the explicit bias-variance decomposition allows the practitioner to select $\beta$ to minimize the variance of policy/reward estimators in practical deployments.

Empirical Validation

The authors present extensive numerical experiments utilizing a forest management MDP, varying $\beta$ over $[\frac{1}{2},1]$ and network width $N=10,000$ , with Monte Carlo repetition to estimate convergence metrics and output variance.

Figure 1: The Reward and Actor MSE Loss as a function of training time.

The first set of experiments demonstrates that convergence to optimal policy and maximum reward is fastest for $\beta$ close to $1$, reinforcing the theoretical prediction for bias term decay. MSE loss curves and reward trajectories are consistent across repeated trials and demonstrate rapid decay with increasing $\beta$ .

Figure 2: Standard deviation Monte Carlo estimates for the Actor, the Critic, and the Rewards as a function of training progress for different values of the scaling parameter $\beta$ .

Variance analysis shows that output variance for actor, critic, and reward estimators decreases sharply as $\beta$ moves toward $1$, in precise agreement with the asymptotic scaling $N^{\frac{1}{2}-\beta}$ . This supports the claim of enhanced statistical robustness available by tuning $\beta$ , making the estimates less sensitive to initialization stochasticity and more stable in deployment.

Mathematical and Algorithmic Implications

From a mathematical perspective, the work generalizes shallow network scaling analyses from the regression domain to the RL setting, where Markov chain sampling in SGD training introduces additional technical complications. Notably, the non-homogeneity and slow time variation of the chains required the development of new martingale bounds and stationary distribution fluctuation characterizations.

Practically, the findings dictate that scaling the output of both networks with a large $\beta$ —in the mean-field or feature-learning regime—combined with matched learning/exploration rates per the derived scaling laws, enables actor-critic systems to achieve both rapid convergence and minimal estimator uncertainty for large $N$ . The prescriptions cover network initialization, learning rates $\sim N^{2\beta-2}$ , and exploration decay.

Future Directions

Potential avenues for further research include:

Extension to multi-layer (deep) architectures, investigating the layer-wise sensitivity to the scaling parameter as documented in deep regression settings.
Analysis of other optimization algorithms (e.g., Adam, RMSProp) within the scaling framework developed here.
Investigation of more complex, non-stationary environments and richer RL domains.

Conclusion

This paper provides a rigorous, quantitative framework for understanding and controlling the scaling-induced bias and variance in neural actor-critic RL algorithms. By linking asymptotic behavior to algorithmic hyperparameters, it enables practitioners to optimally tune shallow neural network estimators for statistical efficiency and robustness, and sets the stage for future studies in large-scale, mean-field RL modeling and uncertainty quantification.