Papers
Topics
Authors
Recent
Search
2000 character limit reached

Optimistic Actor-Critic in Deep RL

Updated 19 January 2026
  • Optimistic Actor-Critic (OAC) is a continuous control algorithm that leverages upper confidence bounds and dual critics for uncertainty estimation, enhancing exploration efficiency.
  • It modifies standard actor-critic methods like SAC and TD3 by incorporating optimistic policy updates under KL constraints to mitigate overestimation bias.
  • Empirical evaluations on benchmarks and extensions like USAC and SOAR demonstrate improved sample efficiency and robustness in exploration and imitation learning.

Optimistic Actor-Critic (OAC) is a class of continuous control algorithms that employ upper confidence bounds on state-action value functions to address exploration inefficiencies in off-policy deep reinforcement learning. OAC modifies standard actor-critic architectures such as Soft Actor-Critic (SAC) and TD3 by coupling multiple critic networks to produce uncertainty estimates and then applying the principle of optimism in the face of uncertainty to generate exploratory behavior while mitigating Q-overestimation bias. The method has led to improved sample efficiency in benchmark continuous control environments and has inspired a series of extensions, including USAC and its application to imitation learning in SOAR.

1. Motivation and Background

Traditional off-policy actor-critic algorithms, such as SAC and TD3, employ ensembles of Q-functions to combat overestimation bias, using pessimistic targets like QL(s,a)=min(Q1(s,a),Q2(s,a))Q_L(s,a) = \min(Q^1(s,a), Q^2(s,a)) (Ciosek et al., 2019). However, this pessimism combined with greedy actor updates leads to pessimistic under-exploration: agents inefficiently avoid actions outside regions of high critic confidence. Moreover, naive Gaussian policies sample actions symmetrically around the mean, disregarding directional informativeness—yielding directionally uninformed exploration. These deficiencies result in high sample requirements and suboptimal policy refinement.

OAC directly addresses these exploration bottlenecks by estimating both lower and upper confidence bounds on Q-values, driving data collection with optimism while preserving the stability benefits of minimization-based critics.

2. Construction of Optimistic Confidence Bounds

The core of OAC is a two-critic architecture yielding empirical mean and uncertainty estimates at each state-action pair: μQ(s,a)=12(Q1(s,a)+Q2(s,a)),σQ(s,a)=12Q1(s,a)Q2(s,a)\mu_Q(s, a) = \frac{1}{2}(Q^1(s, a) + Q^2(s, a)),\qquad \sigma_Q(s, a) = \frac{1}{2} | Q^1(s, a) - Q^2(s, a) | Composite Q-functions are then defined: QL(s,a)=μQ(s,a)βLBσQ(s,a)Q_L(s, a) = \mu_Q(s, a) - \beta_{\text{LB}} \sigma_Q(s, a)

QU(s,a)=μQ(s,a)+βUBσQ(s,a),Q_U(s, a) = \mu_Q(s, a) + \beta_{\text{UB}} \sigma_Q(s, a),

with βLB>0\beta_{\text{LB}} > 0 for conservatism and βUB>0\beta_{\text{UB}} > 0 for optimism (Ciosek et al., 2019). In practice, βUB\beta_{\text{UB}} values of 4–5 are typical for exploration, while βLB=1\beta_{\text{LB}} = 1 recovers standard SAC/TD3 targets.

As generalized in Utility Soft Actor-Critic (USAC), the utility function over the critic ensemble is parameterized: UλQ(s,a)=1λlogEQQ[eλQ(s,a)]U^{\mathcal Q}_\lambda(s,a) = \frac{1}{\lambda} \log \mathbb{E}_{Q \sim \mathcal Q}[e^{\lambda Q(s,a)}] and, under a Laplace approximation, simplified as

UQκ(s,a)=μQ(s,a)+g(κ)σQ(s,a),U^{\mathcal Q_\kappa}(s,a) = \mu_{\mathcal Q}(s,a) + g(\kappa)\, \sigma_{\mathcal Q}(s,a),

where g(κ)g(\kappa) is a monotonic function and κ\kappa determines the optimism-pessimism bias. This formulation recovers OAC as the special case κcritic0.8316\kappa_\text{critic} \approx -0.8316, g(κactor)=βg(\kappa_\text{actor}) = \beta (Tasdighi et al., 2024).

3. Optimistic Actor Update and Policy Improvement

OAC implements a split actor architecture: a target policy πT\pi_T for learning and an exploration policy πE\pi_E for environment interaction. At each state, πE\pi_E maximizes the upper bound QUQ_U subject to a Kullback-Leibler divergence constraint from πT\pi_T: (μE,ΣE)=argmaxμ,Σ  EaN(μ,Σ)[QU(s,a)],KL(N(μ,Σ)  N(μT,ΣT))δ(\mu_E, \Sigma_E) = \arg\max_{\mu, \Sigma}\; \mathbb{E}_{a\sim\mathcal{N}(\mu, \Sigma)} [ Q_U(s, a) ],\quad \mathrm{KL}(\mathcal{N}(\mu, \Sigma)\ \|\ \mathcal{N}(\mu_T, \Sigma_T)) \le \delta Because QUQ_U is approximately linear in aa near μT\mu_T, the closed-form solution shifts μE\mu_E in the direction of the optimistic gradient: ΣE=ΣT,μE=μT+2δΣTaQU(s,a)a=μTaQU(s,a)a=μTΣT\Sigma_E = \Sigma_T,\qquad \mu_E = \mu_T + \sqrt{2\delta}\cdot\frac{\Sigma_T\, \nabla_a Q_U(s, a)|_{a=\mu_T}}{ \| \nabla_a Q_U(s, a)|_{a=\mu_T} \|_{\Sigma_T} } (Ciosek et al., 2019). The policy πT\pi_T is updated to maximize the conservative lower bound QLQ_L with entropy regularization.

The actor update in USAC maximizes the entropy-regularized utility: maxπEsD,aπ[UλactorQ~(s,a)αlogπ(as)]\max_{\pi} \mathbb{E}_{s\sim D, a\sim\pi}[U^{\tilde{\mathcal Q}}_{\lambda_{\text{actor}}}(s,a) - \alpha \log \pi(a|s)] where Q~\tilde{\mathcal Q} denotes the current critic ensemble (Tasdighi et al., 2024).

4. Critic Update and Stability

Both OAC and USAC update the critic networks using a bootstrapped Bellman backup. OAC employs the pessimistic minimum over target critics: Qwi(st,at)rt+γminj=1,2Qwˉj(st+1,a),aπT(st+1)Q^i_w(s_t, a_t) \leftarrow r_t + \gamma \min_{j=1,2} Q^j_{\bar w}(s_{t+1}, a'),\quad a' \sim \pi_T(\cdot | s_{t+1}) and applies Polyak averaging for target networks (Ciosek et al., 2019). USAC generalizes this by employing the utility function with independently chosen λcritic\lambda_{\text{critic}} for the critic’s bootstrapped target: y(s,a,r,s)=r+γ[UλcriticQˉ(s,a)αlogπ(as)]y(s,a,r,s') = r + \gamma [ U^{\bar{\mathcal Q}}_{\lambda_{\text{critic}}}(s', a') - \alpha \log \pi(a' | s')] Empirical ensemble statistics (μQ,σQ\mu_{\mathcal Q}, \sigma_{\mathcal Q}) are computed at each backup step.

This approach stabilizes value estimation by suppressing overestimation bias (the “deadly triad”), while the decoupling of actor and critic optimism allows for separate control over exploitation and exploration.

5. Algorithmic and Implementation Details

A typical OAC/SOAR update loop comprises:

  • Initializing actor πϕ\pi_\phi, LL critic networks {Qθ}\{Q_{\theta_\ell}\} and their targets.
  • Iteratively (a) collecting environment rollouts with the exploration policy, (b) updating critics using the optimistic utility-based targets, and (c) updating the actor via policy gradients on the optimistic Q-value (Viel et al., 27 Feb 2025).
  • Standard implementation choices include actor/critic MLPs (hidden layer sizes 256), batch sizes 256, replay buffer 10610^6, Polyak factor τ\tau (e.g., $0.005$), entropy weight α\alpha (fixed or dual gradient).
  • Ensemble sizes are two (OAC/USAC) or four (SOAR); in SOAR, standard deviation estimates are clipped for stability.

OAC incurs almost no extra computation relative to SAC/TD3 since the UCB gradient is computed via standard autodiff and the ensemble does not substantially increase overhead (Ciosek et al., 2019).

6. Theoretical Guarantees and Insights

USAC rigorously extends the proof techniques of SAC, guaranteeing soft-policy improvement: Jα(πnew)Jα(πold)J_{\alpha}(\pi_{\mathrm{new}}) \geq J_{\alpha}(\pi_{\mathrm{old}}) when the critic ensemble collapses to a Dirac function at QπQ^\pi (Tasdighi et al., 2024). OAC analysis shows that maximizing optimistic upper bounds under KL constraints prevents premature collapse of policy variance, sustaining agent exploration in directions favored by uncertainty.

In SOAR, tabular instantiations yield regret bounds and sample complexity of O~(S4A(1γ)5ϵ2)\tilde{O}\left( \frac{S^4A}{(1-\gamma)^5 \epsilon^2} \right) per environment, matching best-known IL guarantees (Viel et al., 27 Feb 2025).

7. Empirical Evaluation and Impact

OAC has been benchmarked on MuJoCo environments including Ant, Hopper, HalfCheetah, Humanoid, and Walker2d. Notable results include:

Environment SAC Avg Return OAC/USAC Avg Return Sample Efficiency (M steps for return 5000)
Ant-v2/v4 4756 ± 1411 USAC 5139 ± 978
HalfCheetah-v4 10763 ± 895 USAC 11024 ± 849
Hopper-v4 3185 ± 537 USAC 3194 ± 810 (3442)
Humanoid-v2/v4 5503 ± 373 USAC 5602 ± 505 SAC: ~3M; OAC: ~2.2M
Walker2d-v4 3757 ± 1282 USAC 4525 ± 534

OAC yields 10–20% reduction in sample steps to achieve benchmark returns on Humanoid, with performance robust to KL-threshold hyperparameters and policy variance across random seeds matching SAC (Ciosek et al., 2019, Tasdighi et al., 2024). Ablation studies confirm the necessity of optimism: replacing the upper bound in exploration by the lower bound leads to degraded performance.

SOAR enhances IL algorithms (f-IRL, ML-IRL, CSIL), consistently halving the necessary episodes to reach baseline performance, using an ensemble-based optimistic critic (Viel et al., 27 Feb 2025).

8. Extensions and Generalizations

The USAC framework generalizes OAC by decoupling pessimism and optimism via interpretable parameters (λcritic,λactor)(\lambda_{\text{critic}}, \lambda_{\text{actor}}), enabling independent adjustment of bootstrapped value estimates and exploration incentives. This “two-parameter utility framework” allows simultaneous bias reduction and directed exploration, overcoming limitations of SAC, TD3, and OAC’s unidirectional optimism (Tasdighi et al., 2024).

In imitation learning, the SOAR template demonstrates transferability of OAC principles, boosting algorithmic efficiency and providing provable sample complexity in tabular settings. SOAR’s ensemble critic construction and optimistic aggregation elevate exploration in regions of high uncertainty while leveraging standard SAC mechanisms.

A plausible implication is that further scaling of actor-critic utility parameterization, with structured ensemble diversity and uncertainty quantification, will continue to enhance the exploration-exploitation trade-off in high-dimensional RL.

9. Practical Considerations and Limitations

OAC and its variants maintain plug-and-play compatibility over SAC/TD3 architectures, with computational costs comparable to dual-critic approaches. Hyperparameter sensitivity (e.g., 2δ\sqrt{2\delta} in the KL-constraint, βUB\beta_{\text{UB}} for optimism) is moderate, with performance robust in wide ranges.

Over-optimism can induce risk-prone behavior and poor convergence if not properly balanced. Empirical note: strong optimism benefits exploration in poorly explored contexts but must be controlled for stability and expected return consistency. The decoupling introduced in USAC mitigates this instability by allowing risk-neutral, pessimistic, or optimistic configurations for actor and critic independently.

No known controversies regarding OAC's effectiveness have been raised in the cited literature, though adaptability to high-dimensional or partially observable environments remains an ongoing research direction.


References

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Optimistic Actor-Critic (OAC).