Origin of the persistent bias in neural actor–critic algorithms

Establish whether the distinct bias–variance trade-off observed for the shallow neural actor–critic reinforcement-learning algorithm trained via stochastic gradient descent under 1/N^β scaling (with β in (1/2, 1)) arises from the algorithm’s online learning nature and/or from the coupled dynamics between the actor and critic networks, by proving that these factors induce a more persistent bias component than in neural network regression.

Background

The paper derives asymptotic expansions for the actor and critic outputs in a shallow neural actor–critic algorithm trained via stochastic gradient descent, providing a bias–variance decomposition under general 1/Nβ scaling with β∈(1/2,1). The results show that variance decreases as β approaches 1 and identify leading-order bias and variance terms.

Unlike neural network regression, where the bias diminishes rapidly with training time, the authors observe a distinct and more persistent bias component in the actor–critic setting. They posit that this difference may stem from the online nature of learning in reinforcement learning and the coupled evolution of the actor and critic parameters, but this causal explanation is not established within the paper.

References

Unlike the neural network regression problem, our results reveal a distinct bias–variance trade-off in the actor–critic framework. We conjecture that this behavior arises from the algorithm’s online learning nature and/or the coupled dynamics of the actor and critic networks, which induce a more persistent bias component.

Scaling Effects and Uncertainty Quantification in Neural Actor Critic Algorithms  (2601.17954 - Georgoudios et al., 25 Jan 2026) in Conclusion (before Appendices)