Stage-Based Q-Learning: Ref-Advantage Decomposition

Updated 21 January 2026

The paper presents a stage-based Q-learning algorithm that decomposes Q-values into reference and advantage components to achieve significant variance reduction and gap-dependent regret bounds.
It employs dual estimates—with standard UCB and reference-advantage updates—to stabilize learning and enable nearly linear speedup in federated and multi-agent settings.
The method offers rigorous guarantees including improved policy switching cost bounds and better dependence on problem-specific gaps compared to classical Q-learning approaches.

Stage-based Q-Learning with Reference-Advantage Decomposition is a family of model-free reinforcement learning algorithms designed for finite-horizon episodic Markov Decision Processes (MDPs) that exploit structured error decomposition and variance reduction. These approaches, notably UCB-Advantage and Q-EarlySettled-Advantage, partition state-action-horizon visits into exponentially growing stages and utilize reference-advantage decomposed confidence intervals to obtain near-optimal and gap-dependent regret bounds. Extension to federated and multi-agent settings has further demonstrated nearly linear speedup and log-scale communication with analogous theoretical guarantees.

1. Mathematical Foundation and Problem Setting

In these algorithms, the environment is a finite-horizon, episodic MDP $M = (\mathcal{S}, \mathcal{A}, H, \{\mathbb{P}_h\}_{h=1}^H, \{r_h\}_{h=1}^H)$ , with $\mathcal{S}$ and $\mathcal{A}$ as state and action sets of cardinalities $S$ and $A$ , and horizon $H$ . Transitions are governed by $\mathbb{P}_h(\cdot|s,a)$ , and rewards $r_h(s,a) \in [0,1]$ may be deterministic. For each episode $k$ and stage $h$ , the policy $\pi$ yields $V_h^\pi(s)$ (expected cumulative future reward) and $Q_h^\pi(s, a) = r_h(s, a) + \mathbb{P}_h V_{h+1}^\pi(s, a)$ .

Optimal values satisfy Bellman equations:

$V_h^\star(s) = \max_{a} Q_h^\star(s, a),\quad Q_h^\star(s, a) = r_h(s, a) + \mathbb{P}_h V_{h+1}^\star(s, a)$

Suboptimality gaps are denoted $\Delta_h(s, a) = V_h^\star(s) - Q_h^\star(s, a)$ , and the minimum nonzero gap is $\Delta_{\min}$ . The maximum conditional variance of value propagation is $\mathbb{Q}^\star = \max_{s,a,h} \text{Var}_{s' \sim \mathbb{P}_h(\cdot|s,a)}[V_{h+1}^\star(s')]$ . The cumulative regret over $K$ episodes (total steps $T=KH$ ) is $\text{Regret}(T) = \sum_{k=1}^K [V_1^\star(s_1^k) - V_1^{\pi^k}(s_1^k)]$ . Additionally, policy switching cost (the number of per-step policy changes) is

$N_\text{switch} = \sum_{k=1}^{K-1} \sum_{h=1}^H \sum_{s \in \mathcal{S}} \mathbb{I}[\pi_h^{k+1}(s) \neq \pi_h^k(s)]$

(Zheng et al., 2024).

2. Reference-Advantage Decomposition and Algorithmic Structure

The reference-advantage methodology decomposes $Q$ as $Q_h(s, a) = V_h^\text{ref}(s) + A_h(s, a)$ , with $V_h^\text{ref}(s)$ a stage-settled baseline and $A_h(s, a)$ the local advantage. By fixing $V_h^\text{ref}$ within a stage, variance in $\mathbb{P}_h V_h^\text{ref}$ estimates is minimized, and estimation focuses on the residual $\mathbb{P}_h(V_{h+1} - V_{h+1}^\text{ref})$ .

Stage-based updates divide visits to each $(s, a, h)$ into stages of length $e_1=H,\,e_{i+1} = \lfloor (1+1/H) e_i \rfloor$ . Two Q-estimates are maintained:

Standard UCB estimate: $Q_h^{1,\text{new}} = r + \frac{1}{N} \sum_{i=1}^N V_{h+1}(s_{h+1}^{(i)}) + 2\sqrt{H^2 \iota/N}$ ,
Reference-Advantage estimate: $Q_h^{2,\text{new}} = r + \mu_\text{ref} + \mu_\text{adv} + b_h(s,a)$ ,

where $\mu_\text{ref}$ and $\mu_\text{adv}$ are running means of $V_{h+1}^\text{ref}$ and $V_{h+1} - V_{h+1}^{\text{ref}}$ , and $b_h(s,a)$ is a variance-based bonus:

$b_h(s,a) = 2 \sqrt{ \frac{ (\sigma_\text{ref} - \mu_\text{ref}^2) \iota }{N} } + 2 \sqrt{ \frac{ (\sigma_\text{adv} - \mu_\text{adv}^2) \iota }{N} } + \mathcal{O}\left( \frac{H \iota}{N} + \frac{H \iota}{N^{3/4}} \right)$

with $\iota = \log(SAT/\delta)$ (Zheng et al., 2024).

Each stage update replaces $Q_h(s, a)$ with $\min\{Q_h(s, a), Q_h^{1,\text{new}}, Q_h^{2,\text{new}}\}$ . Reference values $V_h^{\text{ref}}(s)$ are "settled" once total visits to $(s, h)$ surpass a threshold $N_0 = \Theta(SAH^5 \iota / \beta^2)$ , ensuring the reference is close to optimal. Q-EarlySettled-Advantage uses similar updates but fixes $V_h^\text{ref}(s)$ once upper and lower confidence bounds converge within $\beta$ .

3. Error Decomposition and Variance Reduction

Regret analysis relies on decomposing the suboptimality for each $(s, a, h)$ :

$\Delta_h(s,a) = Q_h^\star(s,a) - r_h(s,a) - \mathbb{P}_h V_{h+1}(s,a) = [Q_h^\star - \widehat{Q}_h^1] + [\widehat{Q}_h^1 - \widehat{Q}_h^2] + [\widehat{Q}_h^2 - (r_h + \mathbb{P}_h V_{h+1})]$

The first term is controlled by the standard UCB bonus.
The reference-advantage term is bounded by the variance of the advantage component and can be made logarithmic in $T$ and inversely proportional to $\Delta_{\min}$ .
The difference between Q-estimates is non-negative due to the min-update and does not increase regret.

Reference-advantage decomposition is crucial to excellent variance properties. The reference part, once settled, enjoys variance decay $O(1/n)$ , while the advantage has variance at most $\beta^2$ , enabling much tighter confidence intervals than Hoeffding-only bonus schemes (Zheng et al., 2024).

4. Main Theoretical Guarantees

Gap-dependent bounds for stage-based Q-learning with reference-advantage decomposition substantially improve upon prior results:

UCB-Advantage: For $\Delta_{\min}>0$ , $\beta\in(0,H]$ , and $T=KH$ ,

$\mathbb{E}[\text{Regret}(T)] \leq O\left( \frac{(\mathbb{Q}^\star + \beta^2 H) H^3 SA \ln(SAT)}{\Delta_{\min} + \frac{H^8 S^2 A \ln(SAT) \ln T}{\beta^2}} \right)$

Q-EarlySettled-Advantage:

$\mathbb{E}[\text{Regret}(T)] \leq O\left( \frac{(\mathbb{Q}^\star + \beta^2 H) H^3 SA \ln(SAT)}{\Delta_{\min} + \frac{H^7 SA \ln^2(SAT)}{\beta}} \right)$

UCB-Advantage policy-switching cost: With high probability,

$N_\text{switch} \leq O\left( H |D_{\text{opt}}| \ln\left(\frac{T}{H|D_{\text{opt}}|} + 1 \right) + H|D_{\text{opt}}^c| \ln \frac{H^4 S A^{1/2} \ln(SAT/\delta)}{\beta \sqrt{|D_{\text{opt}}^c|} \Delta_{\min}} \right)$

(Zheng et al., 2024).

Comparison with Hoeffding-based methods (e.g., Q-Hoeffding) shows improvement from $O(H^6 SA \log(SAT) / \Delta_{\min})$ to $O(H^5 SA \log(SAT)/\Delta_{\min})$ in representative regimes. For deterministic or low-variance MDPs, careful $\beta$ tuning can further improve dependence to $\Delta_{\min}^{-1/3}$ .

FedQ-Advantage, a federated extension, provides $\tilde O(\sqrt{MSAH^2T})$ regret with $O(M^2 H^3 S^2 A \ln H \ln T)$ total communication cost, reflecting nearly linear speedup in the number of agents $M$ and maintaining variance-reduction advantages (Zheng et al., 2024).

5. Multi-Agent and Zero-Sum Markov Game Extensions

Reference-advantage decomposition generalizes to multi-agent and zero-sum Markov games. In two-player zero-sum Markov games, model-free stage-based Q-learning with min-gap reference-advantage (tracking optimistic/pessimistic value estimates and setting references at the minimal observed gap) recovers the model-based $O(H^3 SAB/\epsilon^2)$ sample complexity previously unattainable with purely model-free methods (Feng et al., 2023).

Key innovations for Markov games include:

Maintaining optimistic/pessimistic $Q$ and $V$ estimates.
Updating reference pairs when current value-gap is minimal.
Applying coarse correlated equilibrium (CCE) policies to break single-agent monotonicity, with statistical control restored via min-gap reference updates.

Variance-reduced bonuses and reference-advantage separation ensure optimal horizon dependence and sublinear regret scaling in multi-agent contexts, with certified $\epsilon$ -optimal Nash equilibria.

6. Memory Efficiency and Practical Considerations

Stage-based reference-advantage methods (e.g., Q-EarlySettled-Advantage) achieve regret-optimality ( $O(\sqrt{H^2 SAT})$ up to logs) with space complexity $O(SAH)$ and much smaller sample-size requirements than previous memory-efficient algorithms. The early-settle rule for references allows aggressive stage-based freezing of $V_h^\text{ref}$ once the confidence interval converges, controlling drift and maintaining statistical guarantees.

Key properties:

Monotonicity and optimism: Q-estimates are non-increasing and preserve UCB/LCB sandwiching.
Reference closeness: Early-settle logic ensures $\|V_h - V_h^\text{ref}\|\leq 2$ after burning in.
Policy switching and rare-switches: Bounds are provided for policy changes, critical in practical deployment (Li et al., 2021).

7. Significance and Comparison to Prior Work

Stage-based Q-learning with reference-advantage decomposition represents the first fully gap-dependent logarithmic- $T$ regret analysis for variance-reduced Q-learning and provides the first gap-dependent bounds on policy switching cost. It generalizes efficiently to federated settings, zero-sum Markov games, and large-scale multi-agent scenarios, offering memory efficiency, favorable communication cost, and nearly information-theoretic sample complexity.

A plausible implication is that reference-advantage decomposition will remain central to future advances in model-free RL, particularly in scalable and distributed environments where low-variance estimation and fast stage-wise convergence are essential. The separation into reference and advantage estimation is the primary driver of improved gap-dependent rates and enables adaptive exploration while maintaining statistical control (Zheng et al., 2024, Zheng et al., 2024, Feng et al., 2023, Li et al., 2021).