Papers
Topics
Authors
Recent
Search
2000 character limit reached

Periodic Regularized Q-Learning (PRQ)

Updated 10 February 2026
  • Periodic Regularized Q-Learning (PRQ) is a reinforcement learning method that introduces regularization at fixed intervals to improve the stability of Q-learning with function approximation.
  • The algorithm alternates between standard TD updates and periodic regularized projections, mitigating divergence and reducing variance in off-policy scenarios.
  • Empirical results and theoretical analysis demonstrate that PRQ effectively balances contraction properties with minimal fixed-point bias, even in complex agent-state and partially observable settings.

Periodic Regularized Q-Learning (PRQ) encompasses a class of reinforcement learning (RL) algorithms designed to stabilize Q-learning under function approximation by incorporating periodic regularization at the projection level. The core motivation is to achieve robust convergence—especially in off-policy or ill-conditioned scenarios—while mitigating the “regularization bias” introduced by continual regularization. This approach balances the competing objectives of contraction-based stability and fixed-point accuracy, and is applicable in both standard Markov Decision Processes (MDPs) and more general agent-state (e.g., recurrent) frameworks (Yang et al., 3 Feb 2026, Sinha et al., 29 Aug 2025).

1. Motivation and Background

Q-learning with linear function approximation is known to diverge in general off-policy settings, especially when the feature covariance matrix ΦDΦ\Phi^\top D \Phi is ill-conditioned (i.e., small eigenvalues) and traditional projected Bellman iteration is used. This effect, part of the “deadly triad,” exposes instability due to the lack of contraction properties in the unregularized projected Bellman operator.

Regularization has been previously proposed to “lift” the smallest eigenvalues, thereby ensuring strict contraction of the operator. However, full regularization at every iteration leads to a persistent bias in the fixed-point solution (of order O(λ)O(\lambda)). PRQ introduces regularization only at fixed periodic intervals, preserving stability with minimal increase in fixed-point bias (Yang et al., 3 Feb 2026).

In partially observable or agent-state-based RL (where the Q-value updates are based on a latent or recurrent “agent state” rather than the environment state), an analogous need for regularization exists. “Periodic regularized agent-state-based Q-learning” (RePASQL) applies the PRQ principles to these more general regimes (Sinha et al., 29 Aug 2025).

2. Mathematical Formulation

Given a discounted MDP (S,A,P,r,γ)(\mathcal S, \mathcal A, P, r, \gamma) and a linear feature map ϕ:S×ARd\phi: \mathcal S \times \mathcal A \to \mathbb{R}^d, approximate Q-functions as Q(s,a)ϕ(s,a)θQ(s,a) \approx \phi(s,a)^\top \theta.

Let DD be the diagonal matrix of visitation weights and Φ\Phi the feature matrix. Define:

  • Unregularized projection:

Πf=Φ(ΦDΦ)1ΦDf\Pi f = \Phi (\Phi^\top D \Phi )^{-1} \Phi^\top D f

  • Regularized projection (λ>0\lambda > 0):

Πλf=Φ(ΦDΦ+λId)1ΦDf\Pi_\lambda f = \Phi (\Phi^\top D \Phi + \lambda I_d )^{-1} \Phi^\top D f

  • Bellman operator:

(TQ)(s,a)=E[r(s,a)+γmaxaQ(s,a)s,a](\mathcal{T} Q)(s,a) = \mathbb{E}[ r(s,a) + \gamma \max_{a'} Q(s', a') \mid s, a ]

Qk+1=ΠλTQkQ_{k+1} = \Pi_\lambda \mathcal{T} Q_k

This is a contraction but converges to a fixed point biased by O(λ)O(\lambda).

The periodic regularized scheme alternates between unregularized and regularized projections with fixed period pp:

Gk={ΠλT,k0(modp) ΠT,otherwiseG_k = \begin{cases} \Pi_\lambda \mathcal{T}, & k \equiv 0 \pmod p \ \Pi \mathcal{T}, & \text{otherwise} \end{cases}

Idealized batch iteration: Qk+1=GkQkQ_{k+1} = G_k Q_k.

A stochastic approximation version is given by alternating between vanilla temporal difference updates and periodic regularized projections in the parameter space (Yang et al., 3 Feb 2026).

In POMDPs or RL with agent-state ztZz_t \in Z, PRQ maintains a collection of LL periodic Q-tables {Qt(z,a)}\{ Q_t^\ell(z, a) \}, one for each phase in a cycle of length LL, with policy regularization (e.g., via entropy). The update employs the convex conjugate of the regularizer, replacing the standard maxaQ\max_{a'} Q term with the regularized “soft-max” (Sinha et al., 29 Aug 2025).

3. Algorithmic Details

A typical online PRQ algorithm proceeds as follows (Yang et al., 3 Feb 2026):

  1. Observe sample (st,at,rt,st)(s_t, a_t, r_t, s'_t)
  2. Form TD target: ytrt+γmaxaϕ(st,a)θty_t \leftarrow r_t + \gamma \max_{a'} \phi(s'_t, a')^\top \theta_t
  3. Unprojected update: θˉt+1θt+αtϕ(st,at)(ytϕ(st,at)θt)\bar\theta_{t+1} \leftarrow \theta_t + \alpha_t \phi(s_t, a_t)( y_t - \phi(s_t, a_t)^\top \theta_t )
  4. Every pp steps, perform “regularized projection”: θt+1(C+λId)1Cθˉt+1\theta_{t+1} \leftarrow (C + \lambda I_d)^{-1} C \bar\theta_{t+1}, where C=ΦDΦC = \Phi^\top D \Phi; otherwise, set θt+1θˉt+1\theta_{t+1} \leftarrow \bar\theta_{t+1}

In agent-state RL, at each time tt, let =tmodL\ell = t \bmod L, and update only Qt(z,a)Q_t^\ell(z, a). The update is:

Qt+1(z,a)=Qt(z,a)+αt(z,a)[rt+γΩ(Qt(+1)modL(zt+1,))Qt(z,a)]Q^\ell_{t+1}(z,a) = Q^\ell_t(z,a) + \alpha^\ell_t(z,a) \Big[ r_t + \gamma \Omega^*( Q^{(\ell+1)\bmod L}_t(z_{t+1}, \cdot) ) - Q^\ell_t(z,a) \Big]

where Ω()\Omega^*(\cdot) is the convex conjugate of the policy regularizer (with entropy regularization yielding the soft-Q update) (Sinha et al., 29 Aug 2025).

4. Theoretical Properties and Convergence

PRQ achieves geometric convergence toward the optimal projected Q-function up to decomposable noise and bias terms. Under mild assumptions (bounded feature norms and rewards, decaying step-sizes, full-rank covariance), the main finite-time guarantee is:

QTQDρT/pQ0QD+Bnoise+Breg\| Q_T - Q^* \|_D \leq \rho^{T/p} \|Q_0 - Q^*\|_D + B_{\mathrm{noise}} + B_{\mathrm{reg}}

where:

  • ρ<1\rho < 1 is an explicit contraction parameter,
  • Bnoise=O(dlog(1/δ)T12β)B_{\mathrm{noise}} = O \left( \sqrt{ \frac{ d \log(1/\delta) }{ T^{1-2 \beta} } } \right ) for batch size TT,
  • Breg=O(λ/(1γ))B_{\mathrm{reg}} = O ( \lambda / (1 - \gamma) ) is the regularization bias.

The regularized Bellman operator is a contraction for any λ>0\lambda > 0: for all Q,QQ, Q',

ΠλTQΠλTQDκ(λ)γQQD,\| \Pi_\lambda \mathcal{T} Q - \Pi_\lambda \mathcal{T} Q' \|_D \leq \kappa(\lambda) \gamma \| Q - Q' \|_D,

with κ(λ)=(C+λI)1C1\kappa(\lambda) = \| ( C + \lambda I )^{-1} C \| \leq 1 (Yang et al., 3 Feb 2026).

In the agent-state RL setting, under periodic policy and exploration, the convergence is to the unique fixed point of a cycle-composed γL\gamma^L-contraction operator in each “phase” of the cycle. Under periodic step-sizes and ergodic exploratory policies, almost-sure convergence of the Q-tables to their respective fixed points is guaranteed (Sinha et al., 29 Aug 2025).

5. Sample Complexity and Bias–Variance Trade-offs

To ensure QTQDε\| Q_T - Q^* \|_D \leq \varepsilon, it suffices to select:

  • λ=O((1γ)ε)\lambda = O( (1-\gamma) \varepsilon ) to control bias,
  • T=O~(dε2)T = \tilde O ( d \varepsilon^{-2} ) to suppress noise,
  • T/p=O(log(1/ε))T/p = O( \log(1/\varepsilon) ) to contract the initial error.

The total sample complexity, up to logarithmic factors, is:

T=O~(dε2+plog(1/ε)),λ=Θ((1γ)ε)T = \tilde O \left( d \varepsilon^{-2} + p \log( 1/\varepsilon ) \right ), \qquad \lambda = \Theta( (1-\gamma) \varepsilon)

For p=O(1)p=O(1), this matches the minimax rates of unregularized Q-learning in the best-case regime.

Smaller λ\lambda reduces long-term bias but pushes κ(λ)1\kappa(\lambda) \to 1, making contraction slower. Larger pp reduces computational burden but can slow convergence per full contraction cycle. Empirically, p[10,100]p \in [10,100] and λ(1γ)ε\lambda \approx (1-\gamma)\varepsilon typically offer a favorable trade-off (Yang et al., 3 Feb 2026).

6. Practical Implementation and Empirical Behavior

Key algorithmic steps for PRQ are summarized as follows:

Step Operation Frequency
TD Update Stochastic gradient on TD error Every iteration
Projection Regularized (C+λId)1C(C+\lambda I_d)^{-1} C Every pp iterations
Unprojection No explicit projection (identity) All other iterations

Empirical evaluation on synthetic and benchmark environments demonstrates:

  • PRQ eliminates divergence and large oscillations observed in linear Q-learning under function approximation, particularly in off-policy regimes.
  • The bias–variance trade-off is explicit: increasing λ\lambda yields more stable (lower-variance) learning but increases fixed-point bias.
  • Computational savings are substantial: periodic regularization (e.g., p=50p=50) achieves near-identical accuracy to full regularization (p=1p=1) with a fraction of the matrix-inversion overhead (Yang et al., 3 Feb 2026).

In partially observable or agent-state-based tasks, periodic regularized Q-learning achieves almost-sure convergence to the fixed points determined by an induced “regularized pseudo-MDP,” with experimental results aligning closely with theory (Sinha et al., 29 Aug 2025).

7. Extensions and Connections

PRQ establishes a general and practical template for stabilizing TD algorithms under function approximation, extending to settings with recurrent agent states and partial observability. The methodology encompasses both projected value iteration and sample-based stochastic approximation, and the contraction-based analysis framework provides clear design guidelines: regulate only as much as needed for stability.

In the agent-state literature, periodic regularization interacts fruitfully with trainable embedding schemes (e.g., RNN state updates) and with entropy-regularized or soft-Q policy learning, demonstrating robust convergence even under complex, non-Markov, or highly stochastic exploration patterns (Sinha et al., 29 Aug 2025).

Further generalizations may include adaptive selection of λ\lambda and pp (e.g., via cross-validation) and integration with deep RL architectures. A plausible implication is that periodic regularization provides a computationally efficient and theoretically sound mechanism for stabilizing high-dimensional off-policy RL under function approximation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Periodic Regularized Q-learning (PRQ).