Papers
Topics
Authors
Recent
Search
2000 character limit reached

Stage-Based Q-Learning: Ref-Advantage Decomposition

Updated 21 January 2026
  • The paper presents a stage-based Q-learning algorithm that decomposes Q-values into reference and advantage components to achieve significant variance reduction and gap-dependent regret bounds.
  • It employs dual estimates—with standard UCB and reference-advantage updates—to stabilize learning and enable nearly linear speedup in federated and multi-agent settings.
  • The method offers rigorous guarantees including improved policy switching cost bounds and better dependence on problem-specific gaps compared to classical Q-learning approaches.

Stage-based Q-Learning with Reference-Advantage Decomposition is a family of model-free reinforcement learning algorithms designed for finite-horizon episodic Markov Decision Processes (MDPs) that exploit structured error decomposition and variance reduction. These approaches, notably UCB-Advantage and Q-EarlySettled-Advantage, partition state-action-horizon visits into exponentially growing stages and utilize reference-advantage decomposed confidence intervals to obtain near-optimal and gap-dependent regret bounds. Extension to federated and multi-agent settings has further demonstrated nearly linear speedup and log-scale communication with analogous theoretical guarantees.

1. Mathematical Foundation and Problem Setting

In these algorithms, the environment is a finite-horizon, episodic MDP M=(S,A,H,{Ph}h=1H,{rh}h=1H)M = (\mathcal{S}, \mathcal{A}, H, \{\mathbb{P}_h\}_{h=1}^H, \{r_h\}_{h=1}^H), with S\mathcal{S} and A\mathcal{A} as state and action sets of cardinalities SS and AA, and horizon HH. Transitions are governed by Ph(s,a)\mathbb{P}_h(\cdot|s,a), and rewards rh(s,a)[0,1]r_h(s,a) \in [0,1] may be deterministic. For each episode kk and stage hh, the policy π\pi yields Vhπ(s)V_h^\pi(s) (expected cumulative future reward) and Qhπ(s,a)=rh(s,a)+PhVh+1π(s,a)Q_h^\pi(s, a) = r_h(s, a) + \mathbb{P}_h V_{h+1}^\pi(s, a).

Optimal values satisfy Bellman equations:

Vh(s)=maxaQh(s,a),Qh(s,a)=rh(s,a)+PhVh+1(s,a)V_h^\star(s) = \max_{a} Q_h^\star(s, a),\quad Q_h^\star(s, a) = r_h(s, a) + \mathbb{P}_h V_{h+1}^\star(s, a)

Suboptimality gaps are denoted Δh(s,a)=Vh(s)Qh(s,a)\Delta_h(s, a) = V_h^\star(s) - Q_h^\star(s, a), and the minimum nonzero gap is Δmin\Delta_{\min}. The maximum conditional variance of value propagation is Q=maxs,a,hVarsPh(s,a)[Vh+1(s)]\mathbb{Q}^\star = \max_{s,a,h} \text{Var}_{s' \sim \mathbb{P}_h(\cdot|s,a)}[V_{h+1}^\star(s')]. The cumulative regret over KK episodes (total steps T=KHT=KH) is Regret(T)=k=1K[V1(s1k)V1πk(s1k)]\text{Regret}(T) = \sum_{k=1}^K [V_1^\star(s_1^k) - V_1^{\pi^k}(s_1^k)]. Additionally, policy switching cost (the number of per-step policy changes) is

Nswitch=k=1K1h=1HsSI[πhk+1(s)πhk(s)]N_\text{switch} = \sum_{k=1}^{K-1} \sum_{h=1}^H \sum_{s \in \mathcal{S}} \mathbb{I}[\pi_h^{k+1}(s) \neq \pi_h^k(s)]

(Zheng et al., 2024).

2. Reference-Advantage Decomposition and Algorithmic Structure

The reference-advantage methodology decomposes QQ as Qh(s,a)=Vhref(s)+Ah(s,a)Q_h(s, a) = V_h^\text{ref}(s) + A_h(s, a), with Vhref(s)V_h^\text{ref}(s) a stage-settled baseline and Ah(s,a)A_h(s, a) the local advantage. By fixing VhrefV_h^\text{ref} within a stage, variance in PhVhref\mathbb{P}_h V_h^\text{ref} estimates is minimized, and estimation focuses on the residual Ph(Vh+1Vh+1ref)\mathbb{P}_h(V_{h+1} - V_{h+1}^\text{ref}).

Stage-based updates divide visits to each (s,a,h)(s, a, h) into stages of length e1=H,ei+1=(1+1/H)eie_1=H,\,e_{i+1} = \lfloor (1+1/H) e_i \rfloor. Two Q-estimates are maintained:

  • Standard UCB estimate: Qh1,new=r+1Ni=1NVh+1(sh+1(i))+2H2ι/NQ_h^{1,\text{new}} = r + \frac{1}{N} \sum_{i=1}^N V_{h+1}(s_{h+1}^{(i)}) + 2\sqrt{H^2 \iota/N},
  • Reference-Advantage estimate: Qh2,new=r+μref+μadv+bh(s,a)Q_h^{2,\text{new}} = r + \mu_\text{ref} + \mu_\text{adv} + b_h(s,a),

where μref\mu_\text{ref} and μadv\mu_\text{adv} are running means of Vh+1refV_{h+1}^\text{ref} and Vh+1Vh+1refV_{h+1} - V_{h+1}^{\text{ref}}, and bh(s,a)b_h(s,a) is a variance-based bonus:

bh(s,a)=2(σrefμref2)ιN+2(σadvμadv2)ιN+O(HιN+HιN3/4)b_h(s,a) = 2 \sqrt{ \frac{ (\sigma_\text{ref} - \mu_\text{ref}^2) \iota }{N} } + 2 \sqrt{ \frac{ (\sigma_\text{adv} - \mu_\text{adv}^2) \iota }{N} } + \mathcal{O}\left( \frac{H \iota}{N} + \frac{H \iota}{N^{3/4}} \right)

with ι=log(SAT/δ)\iota = \log(SAT/\delta) (Zheng et al., 2024).

Each stage update replaces Qh(s,a)Q_h(s, a) with min{Qh(s,a),Qh1,new,Qh2,new}\min\{Q_h(s, a), Q_h^{1,\text{new}}, Q_h^{2,\text{new}}\}. Reference values Vhref(s)V_h^{\text{ref}}(s) are "settled" once total visits to (s,h)(s, h) surpass a threshold N0=Θ(SAH5ι/β2)N_0 = \Theta(SAH^5 \iota / \beta^2), ensuring the reference is close to optimal. Q-EarlySettled-Advantage uses similar updates but fixes Vhref(s)V_h^\text{ref}(s) once upper and lower confidence bounds converge within β\beta.

3. Error Decomposition and Variance Reduction

Regret analysis relies on decomposing the suboptimality for each (s,a,h)(s, a, h):

Δh(s,a)=Qh(s,a)rh(s,a)PhVh+1(s,a)=[QhQ^h1]+[Q^h1Q^h2]+[Q^h2(rh+PhVh+1)]\Delta_h(s,a) = Q_h^\star(s,a) - r_h(s,a) - \mathbb{P}_h V_{h+1}(s,a) = [Q_h^\star - \widehat{Q}_h^1] + [\widehat{Q}_h^1 - \widehat{Q}_h^2] + [\widehat{Q}_h^2 - (r_h + \mathbb{P}_h V_{h+1})]

  • The first term is controlled by the standard UCB bonus.
  • The reference-advantage term is bounded by the variance of the advantage component and can be made logarithmic in TT and inversely proportional to Δmin\Delta_{\min}.
  • The difference between Q-estimates is non-negative due to the min-update and does not increase regret.

Reference-advantage decomposition is crucial to excellent variance properties. The reference part, once settled, enjoys variance decay O(1/n)O(1/n), while the advantage has variance at most β2\beta^2, enabling much tighter confidence intervals than Hoeffding-only bonus schemes (Zheng et al., 2024).

4. Main Theoretical Guarantees

Gap-dependent bounds for stage-based Q-learning with reference-advantage decomposition substantially improve upon prior results:

  • UCB-Advantage: For Δmin>0\Delta_{\min}>0, β(0,H]\beta\in(0,H], and T=KHT=KH,

E[Regret(T)]O((Q+β2H)H3SAln(SAT)Δmin+H8S2Aln(SAT)lnTβ2)\mathbb{E}[\text{Regret}(T)] \leq O\left( \frac{(\mathbb{Q}^\star + \beta^2 H) H^3 SA \ln(SAT)}{\Delta_{\min} + \frac{H^8 S^2 A \ln(SAT) \ln T}{\beta^2}} \right)

  • Q-EarlySettled-Advantage:

E[Regret(T)]O((Q+β2H)H3SAln(SAT)Δmin+H7SAln2(SAT)β)\mathbb{E}[\text{Regret}(T)] \leq O\left( \frac{(\mathbb{Q}^\star + \beta^2 H) H^3 SA \ln(SAT)}{\Delta_{\min} + \frac{H^7 SA \ln^2(SAT)}{\beta}} \right)

  • UCB-Advantage policy-switching cost: With high probability,

NswitchO(HDoptln(THDopt+1)+HDoptclnH4SA1/2ln(SAT/δ)βDoptcΔmin)N_\text{switch} \leq O\left( H |D_{\text{opt}}| \ln\left(\frac{T}{H|D_{\text{opt}}|} + 1 \right) + H|D_{\text{opt}}^c| \ln \frac{H^4 S A^{1/2} \ln(SAT/\delta)}{\beta \sqrt{|D_{\text{opt}}^c|} \Delta_{\min}} \right)

(Zheng et al., 2024).

Comparison with Hoeffding-based methods (e.g., Q-Hoeffding) shows improvement from O(H6SAlog(SAT)/Δmin)O(H^6 SA \log(SAT) / \Delta_{\min}) to O(H5SAlog(SAT)/Δmin)O(H^5 SA \log(SAT)/\Delta_{\min}) in representative regimes. For deterministic or low-variance MDPs, careful β\beta tuning can further improve dependence to Δmin1/3\Delta_{\min}^{-1/3}.

FedQ-Advantage, a federated extension, provides O~(MSAH2T)\tilde O(\sqrt{MSAH^2T}) regret with O(M2H3S2AlnHlnT)O(M^2 H^3 S^2 A \ln H \ln T) total communication cost, reflecting nearly linear speedup in the number of agents MM and maintaining variance-reduction advantages (Zheng et al., 2024).

5. Multi-Agent and Zero-Sum Markov Game Extensions

Reference-advantage decomposition generalizes to multi-agent and zero-sum Markov games. In two-player zero-sum Markov games, model-free stage-based Q-learning with min-gap reference-advantage (tracking optimistic/pessimistic value estimates and setting references at the minimal observed gap) recovers the model-based O(H3SAB/ϵ2)O(H^3 SAB/\epsilon^2) sample complexity previously unattainable with purely model-free methods (Feng et al., 2023).

Key innovations for Markov games include:

  • Maintaining optimistic/pessimistic QQ and VV estimates.
  • Updating reference pairs when current value-gap is minimal.
  • Applying coarse correlated equilibrium (CCE) policies to break single-agent monotonicity, with statistical control restored via min-gap reference updates.

Variance-reduced bonuses and reference-advantage separation ensure optimal horizon dependence and sublinear regret scaling in multi-agent contexts, with certified ϵ\epsilon-optimal Nash equilibria.

6. Memory Efficiency and Practical Considerations

Stage-based reference-advantage methods (e.g., Q-EarlySettled-Advantage) achieve regret-optimality (O(H2SAT)O(\sqrt{H^2 SAT}) up to logs) with space complexity O(SAH)O(SAH) and much smaller sample-size requirements than previous memory-efficient algorithms. The early-settle rule for references allows aggressive stage-based freezing of VhrefV_h^\text{ref} once the confidence interval converges, controlling drift and maintaining statistical guarantees.

Key properties:

  • Monotonicity and optimism: Q-estimates are non-increasing and preserve UCB/LCB sandwiching.
  • Reference closeness: Early-settle logic ensures VhVhref2\|V_h - V_h^\text{ref}\|\leq 2 after burning in.
  • Policy switching and rare-switches: Bounds are provided for policy changes, critical in practical deployment (Li et al., 2021).

7. Significance and Comparison to Prior Work

Stage-based Q-learning with reference-advantage decomposition represents the first fully gap-dependent logarithmic-TT regret analysis for variance-reduced Q-learning and provides the first gap-dependent bounds on policy switching cost. It generalizes efficiently to federated settings, zero-sum Markov games, and large-scale multi-agent scenarios, offering memory efficiency, favorable communication cost, and nearly information-theoretic sample complexity.

A plausible implication is that reference-advantage decomposition will remain central to future advances in model-free RL, particularly in scalable and distributed environments where low-variance estimation and fast stage-wise convergence are essential. The separation into reference and advantage estimation is the primary driver of improved gap-dependent rates and enables adaptive exploration while maintaining statistical control (Zheng et al., 2024, Zheng et al., 2024, Feng et al., 2023, Li et al., 2021).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Stage-based Q-Learning with Reference-Advantage Decomposition.