Stackelberg Actor-Critic in Bilevel RL

Updated 21 January 2026

Stackelberg Actor-Critic is a reinforcement learning approach that models actor and critic as a leader-follower pair in a bi-level game framework.
It introduces hypergradient updates that account for the critic's best-response, improving convergence and refining policy-gradient estimates.
Empirical results demonstrate enhanced sample efficiency, robust safety, and superior performance across multi-agent, single-agent, and offline RL benchmarks.

Stackelberg Actor-Critic (SAC) is a game-theoretic formulation of reinforcement learning—especially actor-critic methods—that explicitly recognizes the actor and critic as a hierarchical leader-follower pair in a bi-level optimization framework. This approach generalizes standard actor-critic algorithms by optimizing the actor with respect to the critic’s best-response and is motivated by the Stackelberg equilibrium from game theory, where the leader explicitly anticipates the follower’s reactions when making decisions. Stackelberg actor-critic algorithms provide refined policy-gradient updates, provable convergence properties under suitable conditions, and empirical gains in multi-agent, single-agent, and offline RL settings (Zhang et al., 2019, Wen et al., 2021, Zheng et al., 2021, Wei et al., 2024, Prakash et al., 16 May 2025, Cheng et al., 2022, Zeng et al., 18 Sep 2025).

1. Stackelberg Actor-Critic: Formal Bilevel Frameworks

Stackelberg Actor-Critic modules are grounded in bi-level optimization, formally modeling reinforcement learning as a two-stage game. In the canonical single-agent setting, the actor selects policy parameters $\theta$ (leader), anticipating that the critic, parameterized by $\phi$ , will respond by minimizing its loss (follower):

Actor objective: $F(\theta, \phi^*(\theta))$ , where $\phi^*(\theta) = \arg\min_\phi\, G(\theta, \phi)$ , e.g., $G$ is a Bellman-residual or value fit.
Critic objective: $G(\theta, \phi)$ , typically quadratic in $\phi$ and strongly convex for tabular on-policy cases.
Stackelberg gradient: $\nabla_\theta F(\theta, \phi^*(\theta))$ incorporates the total derivative accounting for how changes in actor parameters probe the critic’s best-response manifold.

In multi-agent domains, Stackelberg equilibrium is used to model asymmetric interactions: the designated leader chooses a strategy (“moves first” in the RL hierarchy), the follower selects a best response, and equilibrium is defined recursively (Zhang et al., 2019).

2. Policy Gradient Theorems and Update Rules

The key distinction from standard actor-critic methods is in the actor’s update rule: While classical actor-critic gradients ignore the dependency of the critic’s best response on the actor, Stackelberg actor-critic algorithms compute the hypergradient via the chain rule and implicit function theorem. For parameters $(\theta,\phi)$ ,

$\nabla_\theta F(\theta, \phi^*(\theta)) = \partial_\theta F - \partial_\phi F \cdot [\nabla^2_{\phi\phi} G]^{-1} \nabla_{\theta\phi} G$

This total derivative ensures the actor adjusts not only for immediate changes but also how they induce critic adaptation—the principal insight behind the Stackelberg paradigm (Wen et al., 2021, Zheng et al., 2021, Prakash et al., 16 May 2025). The classical update omits the second term; Stackelberg-AC recovers the true policy gradient in settings fulfilling strong convexity and realizability assumptions for the critic.

3. Algorithmic Instantiations

Multiple instantiations exist for Stackelberg actor-critic architectures across domains. Representative designs include:

Bi-level Actor-Critic (Bi-AC): Centralized training, decentralized execution in two-player Markov games (Zhang et al., 2019).
Stackelberg AC, Res-AC: Practical algorithms estimating the correction term between AC and policy gradient, with closed-form and residual-critic variants (Wen et al., 2021).
Nyström Hypergradient BLPO: Efficient hypergradient computations using Nyström approximations for the inverse Hessian, enhancing scalability and stability in high-dimensional critic parameterizations (Prakash et al., 16 May 2025).
Safe (Weighted) Stackelberg AC: Adversarially trained critics (reward/cost) enable robust, safe policy improvement over baseline policies in offline RL (Wei et al., 2024, Cheng et al., 2022).
Mean Field Stackelberg AC (AC-SMFG): Leader-follower hierarchies with infinite populations (mean field) and gradient-alignment-based single-loop updates, achieving sample-efficient convergence (Zeng et al., 18 Sep 2025).

A common theme is the two-timescale structure: critics (followers) are trained on a faster schedule, tracking the leader’s policy, while the actor (leader) updates on the slower timescale using the total derivative or approximations thereof.

4. Theoretical Properties and Convergence Guarantees

Stackelberg actor-critic algorithms enjoy several key theoretical guarantees—contingent on assumptions such as strong convexity, Lipschitz smoothness, sufficient data coverage, and suitable realizability/completeness conditions:

Convergence to Stackelberg equilibrium: Under suitable assumptions, sequence $(\theta_k, \phi^*(\theta_k))$ converges locally to a Stackelberg stationary point (Zheng et al., 2021, Prakash et al., 16 May 2025).
Policy improvement and safety: In safe offline RL, the weighted Stackelberg actor-critic yields policies that provably outperform arbitrary reference policies without sacrificing safety, with convergence rate $O(N^{-1/2})$ to baseline performance (Wei et al., 2024, Cheng et al., 2022).
Sample efficiency: The Stackelberg hypergradient improves gradient conditioning and reduces limit cycles, accelerating convergence over vanilla actor-critic (Zheng et al., 2021, Wen et al., 2021).
Mean-field RL: AC-SMFG attains the first non-asymptotic sample complexity result for Stackelberg mean field games with $O(\varepsilon^{-2})$ rate to stationary optimality (Zeng et al., 18 Sep 2025).
Assumptions: Realizability, completeness, concentrability, and gradient-alignment conditions are crucial for these guarantees; deviations hamper convergence and policy guarantees.

5. Empirical Validation and Applications

Extensive experiments substantiate the empirical advantages of Stackelberg actor-critic frameworks across domains:

Multi-agent matrix games and coordination: Bi-AC reliably finds asymmetric, Pareto-superior Stackelberg equilibria in games with multiple NE, outperforming Nash-based methods (Escape/Maintain games, highway merging) (Zhang et al., 2019).
Single-agent and continuous control: Stackelberg AC consistently mitigates cycling and improves sample efficiency relative to vanilla AC, often matching or exceeding state-of-the-art baselines like PPO or SAC on OpenAI Gym and Mujoco benchmarks (Wen et al., 2021, Zheng et al., 2021, Prakash et al., 16 May 2025).
Safe/offline RL: WSAC and ATAC variants strictly improve reference policies while maintaining constraints, outperforming baselines in Safety-Gymnasium and D4RL tasks (Wei et al., 2024, Cheng et al., 2022).
Stackelberg mean field games: AC-SMFG achieves faster and higher-quality leader rewards than nested or simultaneous best-response schemes in economics-motivated tasks (Zeng et al., 18 Sep 2025).

Empirical success hinges on accurate critic tracking, reliably estimating (pseudo-)inverse Hessians for hypergradients, careful timescale separation, and (where appropriate) adversarial critic training.

6. Connections, Limitations, and Extensions

Stackelberg actor-critic architectures conceptually unify reinforcement learning and bi-level games, connecting RL with hierarchical game theory and optimization. They extend to decentralized execution, adversarial safety (CMDPs), mean-field games, and bilevel policy optimization.

Limitations include sensitivity to critic realizability, the need for function-approximation guarantees, practical overhead of Hessian inversion (alleviated, e.g., by Nyström method (Prakash et al., 16 May 2025)), and potential failure of gradient-alignment in complex or heavily entangled objectives (Zeng et al., 18 Sep 2025). Most theoretical analyses pertain to regularized, strongly-convex critic settings; adaptive regularization and Hessian-approximation remain open areas, as do extensions beyond two-tier games (e.g., multi-leader hierarchies).

A plausible implication is that Stackelberg actor-critic methodologies, via refined hypergradients and adversarial critic training, will underpin the next generation of coordination and safety-critical RL algorithms, particularly when equilibria selection and robustness are paramount (Zhang et al., 2019, Wei et al., 2024, Zeng et al., 18 Sep 2025).