Origin of the persistent bias in neural actor–critic algorithms
Establish whether the distinct bias–variance trade-off observed for the shallow neural actor–critic reinforcement-learning algorithm trained via stochastic gradient descent under 1/N^β scaling (with β in (1/2, 1)) arises from the algorithm’s online learning nature and/or from the coupled dynamics between the actor and critic networks, by proving that these factors induce a more persistent bias component than in neural network regression.
References
Unlike the neural network regression problem, our results reveal a distinct bias–variance trade-off in the actor–critic framework. We conjecture that this behavior arises from the algorithm’s online learning nature and/or the coupled dynamics of the actor and critic networks, which induce a more persistent bias component.