Neural Actor–Critic Algorithms

Updated 27 January 2026

Neural actor–critic algorithms are reinforcement learning methods that combine a neural network-based actor for action selection with a critic for estimating value functions.
They utilize temporal-difference errors and policy gradients for iterative policy improvement, supported by rigorous convergence and scaling analyses.
These methods are applied in high-dimensional control, PDE solving, and state-of-the-art continuous control benchmarks in both simulated and real-world settings.

Neural actor–critic algorithms constitute a class of reinforcement learning (RL) methods in which both the policy (“actor”) and the value function (“critic”) are parameterized by neural networks. These algorithms employ the actor to select actions, while the critic estimates value functions to provide learning signals for the actor, typically in the form of temporal-difference (TD) errors or policy gradients. They have become foundational in modern deep RL, continuous control, constrained RL, and as frameworks for solving high-dimensional stochastic control and PDE problems. This article surveys neural actor–critic algorithms, emphasizing methodologies, convergence theory, key algorithmic variants, advanced applications, and ongoing challenges.

1. Core Principles and Algorithmic Framework

Neural actor–critic algorithms operate in Markov Decision Processes (MDPs), parameterizing the policy $\pi_\theta(a|s)$ and value function (or action-value function) $V_\phi(s)$ / $Q_\phi(s,a)$ via deep neural networks. The typical learning loop alternates between policy improvement (actor step) and policy (or value) evaluation (critic step), with information transfer mediated through gradients or TD errors.

Actor update: The policy network parameters $\theta$ are adjusted to maximize expected return, often via the policy gradient theorem, e.g., using

$\nabla_\theta J(\theta) = \mathbb{E}_{s,a}\left[ \nabla_\theta \log \pi_\theta(a|s) A^\pi(s,a) \right]$

where $A^\pi(s,a)$ is the advantage function, estimated by the critic.

Critic update: The value or Q-network $\phi$ is trained to fit $V^\pi$ or $Q^\pi$ via TD learning, e.g., minimizing

$L(\phi) = \mathbb{E} \left[ \left( r + \gamma V_\phi(s') - V_\phi(s) \right)^2 \right]$

or the equivalent for Q-functions.

Function approximation: Both actor and critic employ deep NNs. Architectures range from multi-layer perceptrons (MLPs) to convolutional nets (for images) and LSTMs (for sequential data), depending on observations.

This canonical structure underlies off-policy and on-policy actor–critic methods, as well as extensions to constrained and high-dimensional settings (Zahavy et al., 2020, Wang et al., 2023, Cohen et al., 8 Jul 2025, Zhou et al., 2021, Banerjee et al., 2022).

2. Convergence Theory and Scaling Limits

The convergence behavior of neural actor–critic algorithms has been the subject of intensive theoretical studies. Analyses can be categorized by regime:

Single-layer (shallow) and two-layer scaling: In the infinite-width limit, parameter updates can be shown to converge to deterministic ordinary differential equations (ODEs) in function space, with the limiting ODEs describing the evolution of the entire function approximator (Lam et al., 2024, Georgoudios et al., 25 Jan 2026). For actor–critic with shallow NNs and scaling of output weights by $N^{-\beta}$ , the variance of the estimator behaves as $O(N^{1/2-\beta})$ , with $\beta \in (1/2, 1)$ controlling the bias–variance tradeoff (Georgoudios et al., 25 Jan 2026).
Mean-field and Wasserstein-flow perspectives: For over-parameterized two-layer NNs, the training dynamics admit a mean-field limit: the critic evolves under a Wasserstein semigradient flow (continuity equation in parameter space) and the actor evolves via ODEs resembling replicator dynamics or softmax policy-improvement flows. Global convergence, sublinear regret, and persistence-of-exploration properties can be established under mild assumptions (Zhang et al., 2021, Lam et al., 2024).
Finite-time and sample complexity bounds: Recent works with neural natural actor–critic (NAC) variants prove explicit non-asymptotic bounds: sample complexity $\tilde{O}(1/\epsilon^5)$ for actor–critic with entropy regularization and two-layer ReLU networks, with precise dependencies on discount, regularization, and network width (Cayci et al., 2022). Single-timescale neural actor–critic with deep ReLU nets achieves global $O(K^{-1/2})$ convergence in the number of updates (Fu et al., 2020).
Representation learning and stability: In neural actor–critic, critic feature representations may evolve during training. For overparametrized regimes, it is proven that the critic features remain in a bounded neighborhood of their random initialization, preserving the favorable properties of the so-called "lazy training" regime (Zhang et al., 2021).

These analyses establish both global optimality and explicit convergence rates under suitable scaling, stepsize, and regularization choices.

3. Algorithmic Variants and Extensions

Numerous advanced neural actor–critic frameworks have been proposed, extending the original schema in various directions:

Natural Actor–Critic (NAC) and Compatible Gradients: NAC algorithms exploit the natural gradient $F(\theta)^{-1}\nabla_\theta J(\theta)$ , with the Fisher information $F(\theta)$ and compatible function approximation, showing improved stability and sample efficiency, especially under off-policy sampling (Diddigi et al., 2021, Cayci et al., 2022, Gaur et al., 2023).
Single-Timescale and Two-Timescale Architectures: In single-timescale algorithms, actor and critic are updated synchronously using identical step-size and shared TD-error signals, with convergence to a ball around the local optimum (0909.2934, Fu et al., 2020). Two-timescale methods, including mean-field and Wasserstein-flow frameworks, separate fast critic and slow actor updates, yielding better tracking and convergence properties in the infinite-width limit (Zhang et al., 2021).
Self-Tuning and Meta-Learning Actor–Critic: Algorithms like Self-Tuning Actor-Critic (STAC) embed hyperparameters in the computational graph and adapt them online via meta-gradient descent, enabling inclusion of discount, trace, loss weights, and importance sampling factors as differentiable meta-parameters and yielding large sample-efficiency improvements (Zahavy et al., 2020).
Dual Actor–Critic: Dual-AC formulates the actor–critic problem as a saddle-point optimization derived from the Lagrangian dual of the Bellman equation, employing path regularization, trust-region updates, and prioritized reweighting for improved stability and sample-efficiency (Dai et al., 2017).
Constrained and Mean-Field Control Actor–Critic: Extensions to constrained RL employ single-loop algorithms with stochastic SCA for the actor and observation reuse for the critic, establishing almost-sure convergence to KKT points in CMDPs (Wang et al., 2023). For mean-field control (MFC), moment neural networks on the Wasserstein space are used in both actor and critic, with direct distributional trajectory sampling and accurate numerical performance in high dimensions (Pham et al., 2023).
Value-Improved Actor–Critic: In Value-Improved Actor-Critic (VI-AC), a non-parametric greedification operator is applied to the policy's value estimate, providing a second value-improvement step that enables more aggressive updates while maintaining stability (Oren et al., 2024).
Exploration Enhancement: Actor–critic methods augmented with intrinsic motivation such as plausible novelty–based reward shaping (IPNS) improve exploration in high-dimensional continuous control tasks, accelerating convergence and reducing variance (Banerjee et al., 2022).
Hamilton–Jacobi–Bellman PDEs and High-Dimensional Control: Neural actor–critic algorithms enable the direct solution of high-dimensional HJB PDEs by parameterizing value and control as neural networks, using variance-reduced least-squares TD for the critic and policy gradients for the actor, with proven empirical accuracy up to hundreds of dimensions (Zhou et al., 2021, Cohen et al., 8 Jul 2025).

4. Applications and Empirical Performance

Neural actor–critic algorithms underpin a substantial fraction of state-of-the-art methods in contemporary RL and control:

Benchmarks: Architectures such as A3C, DDPG, TD3, SAC, STAC, and Dual-AC deliver strong performance on standard continuous control tasks (MuJoCo, DeepMind Control Suite) and diverse large-scale settings (Atari-57, real-world RL challenges) (Zahavy et al., 2020, Dai et al., 2017, Banerjee et al., 2022, Oren et al., 2024).
Empirical sample efficiency: Meta-gradient self-tuning, auxiliary heads, and off-policy corrections improve median or mean normalized scores on challenging benchmarks by factors up to $\sim$ 1.5\times $or more (<a href="/papers/2002.12928" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Zahavy et al., 2020</a>). Intrinsic exploration bonuses further accelerate learning and control variance in continuous domains (<a href="/papers/2210.00211" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Banerjee et al., 2022</a>).</li> <li><strong>PDE and control</strong>: Neural actor–critic methods with <a href="https://www.emergentmind.com/topics/bigcodebench-hard-dataset" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">hard</a> boundary constraints, Q-PDE gradients, and Hamiltonian minimization solve fully nonlinear stochastic control problems, linear-quadratic regulators, and <a href="https://www.emergentmind.com/topics/mean-field-games" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">mean-field games</a> with sub-percent relative errors in high$ d $(e.g.,$ d=50,100,200 $) (<a href="/papers/2507.06428" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Cohen et al., 8 Jul 2025</a>, <a href="/papers/2102.11379" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Zhou et al., 2021</a>, <a href="/papers/2309.04317" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Pham et al., 2023</a>).</li> <li><strong>Sequence modeling</strong>: Actor–critic models adapted to sequence generation tasks achieve improved task-specific metrics (e.g., BLEU), aligning train and test modes, and surpassing teacher-forcing baselines (<a href="/papers/1607.07086" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Bahdanau et al., 2016</a>).</li> </ul> <p>A summary table of selected variants and their distinctive features follows:</p> <div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>Algorithm/Variant</th> <th>Key Innovations</th> <th>Rigorous Result(s)</th> </tr> </thead><tbody><tr> <td>Single-Timescale <a href="https://www.emergentmind.com/topics/dense-neural-network-dnn" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">DNN</a> AC</td> <td>Simultaneous updates, deep <a href="https://www.emergentmind.com/topics/neural-tangent-kernels-ntk" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">NTK</a> regime</td> <td>Global opt.,$ O(K^{-1/2})$ rate (Fu et al., 2020) Self-Tuning AC (STAC) Meta-gradient hyperparameter adaptation, leaky V-trace Increased sample efficiency (Zahavy et al., 2020) Dual Actor–Critic (Dual-AC) Bellman duality, saddle-point, multi-step, path-reg. Stability, SOTA on control (Dai et al., 2017) Natural AC w/ Two-layer Critic Convex ReLU critic fitting + natural gradient First sample-comp. for nonlinear critic (Gaur et al., 2023) Neural AC for HJB PDEs Hard boundary, Q-PDE update, Hamiltonian minimization NTK limit convergence; >200D PDEs (Cohen et al., 8 Jul 2025)

5. Implementation, Optimization, and Practical Guidelines

State-of-the-art neural actor–critic methods entail several implementation considerations:

Architectures: Deep convolutional and recurrent networks (Atari), multi-layer perceptrons (control, tabular), or specialized networks (moment nets for Wasserstein space control) are standard.
Optimization and scaling: Empirically successful implementations employ Adam or RMSProp, gradient clipping, entropy regularization, and batch normalization. Step-sizes are chosen with consideration for scaling regime (e.g., $O(1/N)$ for critic in the infinite-width NTK regime, diminishing with actor $(1/\log^2 t)$ for sustained exploration (Georgoudios et al., 25 Jan 2026, Lam et al., 2024)).
Regularization: Entropy regularization ensures persistent exploration, avoiding collapse to deterministic policies, and is fundamental for sample complexity improvements and finite-time bounds (Cayci et al., 2022).
Exploration/exploitation tradeoff: Intrinsic rewards (state novelty, benefit functions) are added for plausible exploration. Clipped importance sampling corrects for off-policy data (Zahavy et al., 2020, Banerjee et al., 2022).
Greedification operators: Greedy/value-improved critics enable more aggressive learning, but must be balanced against variance; double critics and top- $k$ averaging schemes are commonly used for stability (Oren et al., 2024).
Distributional shift and function approximation: Uniform approximation bounds and regularization are essential for stability under dynamic distribution shifts across policies.

6. Open Problems, Limitations, and Future Directions

Despite significant advances, several fundamental challenges persist:

Function Approximation Barriers: Accurate convergence guarantees exist only in certain scaling regimes (e.g., infinite width, NTK), and practical deep nets exhibit nontrivial nonconvexity and generalization error (Fu et al., 2020, Gaur et al., 2023).
Variance–Bias Tradeoff: Choices of scaling exponent $\beta$ in network output normalization, stepsizes, and meta-parameters determine statistical robustness and speed; quantification and optimal scheduling remains active research (Georgoudios et al., 25 Jan 2026).
Representation Learning: While lazy training controls representation drift, fully leveraging end-to-end feature learning in actor–critic frameworks requires further theoretical and empirical exploration (Zhang et al., 2021).
Sample Complexity and Exploration: Achieving optimal or near-optimal sample complexity without strong concentrability assumptions, and efficient exploration in high dimensions, remain open.
Safe/Constrained Environments: Extensions to complex constrained MDPs and safety-critical domains are nascent, with algorithms like SLDAC providing almost-sure convergence under reasonable assumptions (Wang et al., 2023).
Generalization to Nonlinear Critics and Deep Architectures: Global optimality, not just local, is established in thermalized or NTK regimes, but real-world RL operates outside these regimes. Extending theory and scalable algorithms to deeper or more data-efficient critic parameterizations is an ongoing target (Gaur et al., 2023).

A plausible implication is that continued expansion of meta-learning, uncertainty quantification, and function-approximation-informed optimization will drive new advances in neural actor–critic methodology, as will application to unresolved domains such as uncertainty-aware control, deep mean-field RL, and PDEs at scale (Cohen et al., 8 Jul 2025, Georgoudios et al., 25 Jan 2026, Pham et al., 2023).