Bi-level Actor-Critic: Hierarchical RL Framework

Updated 22 January 2026

Bi-level Actor-Critic (Bi-AC) is a hierarchical reinforcement learning framework that models the actor and critic as a bi-level optimization problem, where the critic best responds to the actor's policy.
It employs a two-time-scale update mechanism that ensures rapid critic convergence and provides rigorous sample complexity and convergence analyses.
A Stackelberg game-theoretic approach facilitates hypergradient correction, resulting in enhanced sample efficiency and improved policy performance.

A bi-level Actor-Critic (Bi-AC) framework characterizes reinforcement learning algorithms—particularly actor-critic and natural actor-critic (NAC) classes—as hierarchical stochastic optimization or Stackelberg games. In this formulation, the actor (outer level) seeks optimal policy parameters contingent on the critic (inner level)—which models value estimation as a best-response to the current policy. Modern developments formalize the two time-scale separation, theoretical sample complexity, and convergence properties for both standard and natural actor-critic methods under realistic Markovian sampling and general function approximation (Xu et al., 2020, Hong et al., 2020, Prakash et al., 16 May 2025, Zheng et al., 2021, Wen et al., 2021).

1. Bi-level Structure: Markov Decision Process and Parametrization

Bi-AC formalization begins with a Markov Decision Process (MDP) $(\mathcal{S}, \mathcal{A}, \mathcal{P}, r, \gamma)$ . The actor parameterizes the policy $\pi_\theta(a|s)$ (e.g., softmax or Gaussian families) and the critic typically uses a linear function approximator $\psi_w(s,a)$ for value estimation. The advantage $A_\pi(s,a) = Q_\pi(s,a) - V_\pi(s)$ is expressed via a compatible feature mapping $\psi_w = \varphi_w$ , leading to the critic’s regression target: $\omega^*(w) = \arg\min_\omega \mathbb{E}_{(s,a) \sim \nu_{\pi_w}} \left[ (\psi_w(s,a)^\top \omega - A_{\pi_w}(s,a))^2 + \frac{\lambda}{2}\|\omega\|^2 \right]$ (Xu et al., 2020).

The bi-level optimization then seeks:

Outer (Actor): maximize expected discounted or average reward over policy parameter $w$ (or $\theta$ ).
Inner (Critic): minimize TD error or Bellman residual for fixed $w$ .

2. Two Time-Scale and Nested Algorithms

Most Bi-AC implementations adopt a two time-scale learning schedule: the critic update uses a larger learning rate and converges more rapidly than the actor update, ensuring the critic rapidly tracks policy changes. This is mathematically encoded as $0 < \beta_t \ll \alpha_t$ where $\beta$ is the critic step size and $\alpha$ the actor's (Xu et al., 2020, Wu et al., 2020, Hong et al., 2020).

Generic two-time-scale pseudocode is:

For t = 0 .. T-1:
    Sample (s_t, a_t) under current policy π_{w_t}
    Estimate Q_t ≈ Q_{π_{w_t}}(s_t, a_t)
    Critic gradient: g_t(ω_t) = [−ψ_{w_t}(s_t, a_t)^⊤ ω_t + Q_t] ψ_{w_t}(s_t, a_t) − λ ω_t
    Critic update: ω_{t+1} = Proj_{‖·‖≤R_ω} (ω_t + β_t g_t(ω_t))
    Actor update:
        AC: w_{t+1} = w_t + α_t A_{ω_t}(s_t, a_t) ϕ_{w_t}(s_t, a_t)
        NAC: w_{t+1} = w_t + α_t ω_t

(Xu et al., 2020).

Alternative schemes reverse the time scales (Critic-Actor), yielding legitimate algorithms that emulate greedy value iteration (Bhatnagar et al., 2022).

3. Convergence Analysis and Sample Complexity

Finite-time, non-asymptotic convergence analyses establish sample complexity bounds for Bi-AC. In the classical two-time-scale actor-critic:

AC with general policy classes achieves $\mathcal{O}(\epsilon^{-2.5} \log^3(\epsilon^{-1}))$ sample complexity for $\epsilon$ -stationarity (in expected squared gradient norm),
NAC with compatible features attains $\mathcal{O}(\epsilon^{-4} \log^2(\epsilon^{-1}))$ complexity for $\epsilon$ -optimality (Xu et al., 2020, Wu et al., 2020).

A general two-timescale stochastic approximation (TTSA) analysis for bilevel stochastic problems shows:

Strongly convex outer problem: last-iterate error $O(K^{-2/3})$ ,
Weakly convex or smooth nonconvex: $O(K^{-2/5})$ near-stationarity,
For natural actor-critic with KL-Proximal Policy Optimization, expected reward gap is $O(K^{-1/4})$ with sample complexity $O(\epsilon^{-4})$ (Hong et al., 2020).

These rates derive from:

Critic tracking error decay via linear stochastic approximation analysis,
Control of actor bias from critic error and Markov chain mixing,
Standard descent lemmas in non-convex optimization (Xu et al., 2020, Wu et al., 2020, Hong et al., 2020).

4. Stackelberg (Game-Theoretic) Perspective and Hypergradient Correction

Recent work models actor-critic as a two-player Stackelberg game: the actor is the leader, the critic the follower. The actor’s correct gradient is the total derivative accounting for how critic’s parameters $\phi^*(\theta)$ depend on the actor: $\nabla_\theta F(\theta, \phi^*(\theta)) = \nabla_\theta F(\theta, \phi) + \frac{d\phi^*(\theta)}{d\theta} \nabla_\phi F(\theta, \phi)$ where the hypergradient is efficiently computed (e.g., via the Nyström approximation for high-dimensional Hessians) (Prakash et al., 16 May 2025, Zheng et al., 2021). This approach eliminates the “gap” between naive AC and true policy gradient, yielding Residual Actor-Critic or Stackelberg Actor-Critic variants that empirically improve sample efficiency and final policy return (Wen et al., 2021).

5. Extensions: Natural, Multi-Agent, Nonlinear Critics

Bi-AC methodology extends directly to:

Natural actor-critic, where the critic’s compatible features allow the actor update to approximate the natural gradient (Xu et al., 2020, Hong et al., 2020),
Multi-agent settings, where the bi-level schema generalizes to Stackelberg equilibria and can outperform Nash-based baselines in coordination environments (Zhang et al., 2019),
Nonlinear value-function approximators (deep neural critics), for which new nonlinear stochastic approximation theory is required (Xu et al., 2020, Prakash et al., 16 May 2025, Wen et al., 2021).

Empirical studies indicate that bi-level actor-critic algorithms converge efficiently in a variety of domains: tabular grids, linear quadratic regulators, continuous control, and multi-agent games (Yang et al., 2019, Zhang et al., 2019, Prakash et al., 16 May 2025).

6. Practical Implementation, Limitations, and Future Directions

Implementation requirements include:

Careful step-size separation to ensure critic tracking,
Projected or regularized updates to maintain bounded parameter iterates,
Efficient solvers for inner-loop critic minimization (TD(0), GTD, or best-response gradient descent).

Current limitations:

Polynomial dependence on $(1-\gamma)^{-1}$ remains significant; reducing this is open,
Nonlinear critic theory and variance reduction remain underdeveloped,
Assumptions (e.g., strong convexity, ergodicity, global optimal action selection) may not hold universally.

Extensions under active investigation comprise variance reduction for on-policy methods, trust-region bi-level extensions (e.g., PPO), and efficient hypergradient estimation for deep RL (Hong et al., 2020, Prakash et al., 16 May 2025).

7. Summary of Results and Comparative Performance

Bi-AC achieves state-of-the-art theoretical guarantees for sample efficiency, finite-sample stationarity, and optimality in two-time-scale actor-critic frameworks. It closes a fundamental gap between standard actor-critic and policy gradient methods (Wen et al., 2021). Empirical performance matches or exceeds that of standard baselines (including PPO and SAC) in both discrete and continuous control environments, and reliably finds Pareto-superior equilibria in multi-agent settings (Prakash et al., 16 May 2025, Zhang et al., 2019). The Stackelberg-bi-level perspective now underpins next-generation actor-critic and RL policy optimization algorithms.

Table: Key Sample Complexity Results for Bi-level Actor-Critic

Algorithm	Stationarity/Optimality Criterion	Sample Complexity
Two-time-scale AC	$\epsilon$ -stationary (grad norm)	$\mathcal{O}(\epsilon^{-2.5} \log^3(\epsilon^{-1}))$ (Xu et al., 2020, Wu et al., 2020)
Two-time-scale NAC	$\epsilon$ -optimal (expected reward)	$\mathcal{O}(\epsilon^{-4} \log^2(\epsilon^{-1}))$ (Xu et al., 2020)
Stackelberg AC	Local Stackelberg equilibrium	Polynomial time (Prakash et al., 16 May 2025, Zheng et al., 2021)
TTSA (PPO case)	Expected reward gap	$\mathcal{O}(\epsilon^{-4})$ (Hong et al., 2020)