Optimistic Actor-Critic in Deep RL

Updated 19 January 2026

Optimistic Actor-Critic (OAC) is a continuous control algorithm that leverages upper confidence bounds and dual critics for uncertainty estimation, enhancing exploration efficiency.
It modifies standard actor-critic methods like SAC and TD3 by incorporating optimistic policy updates under KL constraints to mitigate overestimation bias.
Empirical evaluations on benchmarks and extensions like USAC and SOAR demonstrate improved sample efficiency and robustness in exploration and imitation learning.

Optimistic Actor-Critic (OAC) is a class of continuous control algorithms that employ upper confidence bounds on state-action value functions to address exploration inefficiencies in off-policy deep reinforcement learning. OAC modifies standard actor-critic architectures such as Soft Actor-Critic (SAC) and TD3 by coupling multiple critic networks to produce uncertainty estimates and then applying the principle of optimism in the face of uncertainty to generate exploratory behavior while mitigating Q-overestimation bias. The method has led to improved sample efficiency in benchmark continuous control environments and has inspired a series of extensions, including USAC and its application to imitation learning in SOAR.

1. Motivation and Background

Traditional off-policy actor-critic algorithms, such as SAC and TD3, employ ensembles of Q-functions to combat overestimation bias, using pessimistic targets like $Q_L(s,a) = \min(Q^1(s,a), Q^2(s,a))$ (Ciosek et al., 2019). However, this pessimism combined with greedy actor updates leads to pessimistic under-exploration: agents inefficiently avoid actions outside regions of high critic confidence. Moreover, naive Gaussian policies sample actions symmetrically around the mean, disregarding directional informativeness—yielding directionally uninformed exploration. These deficiencies result in high sample requirements and suboptimal policy refinement.

OAC directly addresses these exploration bottlenecks by estimating both lower and upper confidence bounds on Q-values, driving data collection with optimism while preserving the stability benefits of minimization-based critics.

2. Construction of Optimistic Confidence Bounds

The core of OAC is a two-critic architecture yielding empirical mean and uncertainty estimates at each state-action pair: $\mu_Q(s, a) = \frac{1}{2}(Q^1(s, a) + Q^2(s, a)),\qquad \sigma_Q(s, a) = \frac{1}{2} | Q^1(s, a) - Q^2(s, a) |$ Composite Q-functions are then defined: $Q_L(s, a) = \mu_Q(s, a) - \beta_{\text{LB}} \sigma_Q(s, a)$

$Q_U(s, a) = \mu_Q(s, a) + \beta_{\text{UB}} \sigma_Q(s, a),$

with $\beta_{\text{LB}} > 0$ for conservatism and $\beta_{\text{UB}} > 0$ for optimism (Ciosek et al., 2019). In practice, $\beta_{\text{UB}}$ values of 4–5 are typical for exploration, while $\beta_{\text{LB}} = 1$ recovers standard SAC/TD3 targets.

As generalized in Utility Soft Actor-Critic (USAC), the utility function over the critic ensemble is parameterized: $U^{\mathcal Q}_\lambda(s,a) = \frac{1}{\lambda} \log \mathbb{E}_{Q \sim \mathcal Q}[e^{\lambda Q(s,a)}]$ and, under a Laplace approximation, simplified as

$U^{\mathcal Q_\kappa}(s,a) = \mu_{\mathcal Q}(s,a) + g(\kappa)\, \sigma_{\mathcal Q}(s,a),$

where $g(\kappa)$ is a monotonic function and $\kappa$ determines the optimism-pessimism bias. This formulation recovers OAC as the special case $\kappa_\text{critic} \approx -0.8316$ , $g(\kappa_\text{actor}) = \beta$ (Tasdighi et al., 2024).

3. Optimistic Actor Update and Policy Improvement

OAC implements a split actor architecture: a target policy $\pi_T$ for learning and an exploration policy $\pi_E$ for environment interaction. At each state, $\pi_E$ maximizes the upper bound $Q_U$ subject to a Kullback-Leibler divergence constraint from $\pi_T$ : $(\mu_E, \Sigma_E) = \arg\max_{\mu, \Sigma}\; \mathbb{E}_{a\sim\mathcal{N}(\mu, \Sigma)} [ Q_U(s, a) ],\quad \mathrm{KL}(\mathcal{N}(\mu, \Sigma)\ \|\ \mathcal{N}(\mu_T, \Sigma_T)) \le \delta$ Because $Q_U$ is approximately linear in $a$ near $\mu_T$ , the closed-form solution shifts $\mu_E$ in the direction of the optimistic gradient: $\Sigma_E = \Sigma_T,\qquad \mu_E = \mu_T + \sqrt{2\delta}\cdot\frac{\Sigma_T\, \nabla_a Q_U(s, a)|_{a=\mu_T}}{ \| \nabla_a Q_U(s, a)|_{a=\mu_T} \|_{\Sigma_T} }$ (Ciosek et al., 2019). The policy $\pi_T$ is updated to maximize the conservative lower bound $Q_L$ with entropy regularization.

The actor update in USAC maximizes the entropy-regularized utility: $\max_{\pi} \mathbb{E}_{s\sim D, a\sim\pi}[U^{\tilde{\mathcal Q}}_{\lambda_{\text{actor}}}(s,a) - \alpha \log \pi(a|s)]$ where $\tilde{\mathcal Q}$ denotes the current critic ensemble (Tasdighi et al., 2024).

4. Critic Update and Stability

Both OAC and USAC update the critic networks using a bootstrapped Bellman backup. OAC employs the pessimistic minimum over target critics: $Q^i_w(s_t, a_t) \leftarrow r_t + \gamma \min_{j=1,2} Q^j_{\bar w}(s_{t+1}, a'),\quad a' \sim \pi_T(\cdot | s_{t+1})$ and applies Polyak averaging for target networks (Ciosek et al., 2019). USAC generalizes this by employing the utility function with independently chosen $\lambda_{\text{critic}}$ for the critic’s bootstrapped target: $y(s,a,r,s') = r + \gamma [ U^{\bar{\mathcal Q}}_{\lambda_{\text{critic}}}(s', a') - \alpha \log \pi(a' | s')]$ Empirical ensemble statistics ( $\mu_{\mathcal Q}, \sigma_{\mathcal Q}$ ) are computed at each backup step.

This approach stabilizes value estimation by suppressing overestimation bias (the “deadly triad”), while the decoupling of actor and critic optimism allows for separate control over exploitation and exploration.

5. Algorithmic and Implementation Details

A typical OAC/SOAR update loop comprises:

Initializing actor $\pi_\phi$ , $L$ critic networks $\{Q_{\theta_\ell}\}$ and their targets.
Iteratively (a) collecting environment rollouts with the exploration policy, (b) updating critics using the optimistic utility-based targets, and (c) updating the actor via policy gradients on the optimistic Q-value (Viel et al., 27 Feb 2025).
Standard implementation choices include actor/critic MLPs (hidden layer sizes 256), batch sizes 256, replay buffer $10^6$ , Polyak factor $\tau$ (e.g., $0.005$), entropy weight $\alpha$ (fixed or dual gradient).
Ensemble sizes are two (OAC/USAC) or four (SOAR); in SOAR, standard deviation estimates are clipped for stability.

OAC incurs almost no extra computation relative to SAC/TD3 since the UCB gradient is computed via standard autodiff and the ensemble does not substantially increase overhead (Ciosek et al., 2019).

6. Theoretical Guarantees and Insights

USAC rigorously extends the proof techniques of SAC, guaranteeing soft-policy improvement: $J_{\alpha}(\pi_{\mathrm{new}}) \geq J_{\alpha}(\pi_{\mathrm{old}})$ when the critic ensemble collapses to a Dirac function at $Q^\pi$ (Tasdighi et al., 2024). OAC analysis shows that maximizing optimistic upper bounds under KL constraints prevents premature collapse of policy variance, sustaining agent exploration in directions favored by uncertainty.

In SOAR, tabular instantiations yield regret bounds and sample complexity of $\tilde{O}\left( \frac{S^4A}{(1-\gamma)^5 \epsilon^2} \right)$ per environment, matching best-known IL guarantees (Viel et al., 27 Feb 2025).

7. Empirical Evaluation and Impact

OAC has been benchmarked on MuJoCo environments including Ant, Hopper, HalfCheetah, Humanoid, and Walker2d. Notable results include:

Environment	SAC Avg Return	OAC/USAC Avg Return	Sample Efficiency (M steps for return 5000)
Ant-v2/v4	4756 ± 1411	USAC 5139 ± 978	–
HalfCheetah-v4	10763 ± 895	USAC 11024 ± 849	–
Hopper-v4	3185 ± 537	USAC 3194 ± 810 (3442)	–
Humanoid-v2/v4	5503 ± 373	USAC 5602 ± 505	SAC: ~3M; OAC: ~2.2M
Walker2d-v4	3757 ± 1282	USAC 4525 ± 534	–

OAC yields 10–20% reduction in sample steps to achieve benchmark returns on Humanoid, with performance robust to KL-threshold hyperparameters and policy variance across random seeds matching SAC (Ciosek et al., 2019, Tasdighi et al., 2024). Ablation studies confirm the necessity of optimism: replacing the upper bound in exploration by the lower bound leads to degraded performance.

SOAR enhances IL algorithms (f-IRL, ML-IRL, CSIL), consistently halving the necessary episodes to reach baseline performance, using an ensemble-based optimistic critic (Viel et al., 27 Feb 2025).

8. Extensions and Generalizations

The USAC framework generalizes OAC by decoupling pessimism and optimism via interpretable parameters $(\lambda_{\text{critic}}, \lambda_{\text{actor}})$ , enabling independent adjustment of bootstrapped value estimates and exploration incentives. This “two-parameter utility framework” allows simultaneous bias reduction and directed exploration, overcoming limitations of SAC, TD3, and OAC’s unidirectional optimism (Tasdighi et al., 2024).

In imitation learning, the SOAR template demonstrates transferability of OAC principles, boosting algorithmic efficiency and providing provable sample complexity in tabular settings. SOAR’s ensemble critic construction and optimistic aggregation elevate exploration in regions of high uncertainty while leveraging standard SAC mechanisms.

A plausible implication is that further scaling of actor-critic utility parameterization, with structured ensemble diversity and uncertainty quantification, will continue to enhance the exploration-exploitation trade-off in high-dimensional RL.

9. Practical Considerations and Limitations

OAC and its variants maintain plug-and-play compatibility over SAC/TD3 architectures, with computational costs comparable to dual-critic approaches. Hyperparameter sensitivity (e.g., $\sqrt{2\delta}$ in the KL-constraint, $\beta_{\text{UB}}$ for optimism) is moderate, with performance robust in wide ranges.

Over-optimism can induce risk-prone behavior and poor convergence if not properly balanced. Empirical note: strong optimism benefits exploration in poorly explored contexts but must be controlled for stability and expected return consistency. The decoupling introduced in USAC mitigates this instability by allowing risk-neutral, pessimistic, or optimistic configurations for actor and critic independently.

No known controversies regarding OAC's effectiveness have been raised in the cited literature, though adaptability to high-dimensional or partially observable environments remains an ongoing research direction.

References

"Better Exploration with Optimistic Actor-Critic" (Ciosek et al., 2019)
"Exploring Pessimism and Optimism Dynamics in Deep Reinforcement Learning" (Tasdighi et al., 2024)
"IL-SOAR: Imitation Learning with Soft Optimistic Actor cRitic" (Viel et al., 27 Feb 2025)

Markdown Report Issue Upgrade to Chat

References (3)

Better Exploration with Optimistic Actor-Critic (2019)

Exploring Pessimism and Optimism Dynamics in Deep Reinforcement Learning (2024)

IL-SOAR : Imitation Learning with Soft Optimistic Actor cRitic (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Optimistic Actor-Critic (OAC).

Optimistic Actor-Critic in Deep RL

1. Motivation and Background

2. Construction of Optimistic Confidence Bounds

3. Optimistic Actor Update and Policy Improvement

4. Critic Update and Stability

5. Algorithmic and Implementation Details

6. Theoretical Guarantees and Insights

7. Empirical Evaluation and Impact

8. Extensions and Generalizations

9. Practical Considerations and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Optimistic Actor-Critic in Deep RL

1. Motivation and Background

2. Construction of Optimistic Confidence Bounds

3. Optimistic Actor Update and Policy Improvement

4. Critic Update and Stability

5. Algorithmic and Implementation Details

6. Theoretical Guarantees and Insights

7. Empirical Evaluation and Impact

8. Extensions and Generalizations

9. Practical Considerations and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research