Periodic Regularized Q-Learning (PRQ)

Updated 10 February 2026

Periodic Regularized Q-Learning (PRQ) is a reinforcement learning method that introduces regularization at fixed intervals to improve the stability of Q-learning with function approximation.
The algorithm alternates between standard TD updates and periodic regularized projections, mitigating divergence and reducing variance in off-policy scenarios.
Empirical results and theoretical analysis demonstrate that PRQ effectively balances contraction properties with minimal fixed-point bias, even in complex agent-state and partially observable settings.

Periodic Regularized Q-Learning (PRQ) encompasses a class of reinforcement learning (RL) algorithms designed to stabilize Q-learning under function approximation by incorporating periodic regularization at the projection level. The core motivation is to achieve robust convergence—especially in off-policy or ill-conditioned scenarios—while mitigating the “regularization bias” introduced by continual regularization. This approach balances the competing objectives of contraction-based stability and fixed-point accuracy, and is applicable in both standard Markov Decision Processes (MDPs) and more general agent-state (e.g., recurrent) frameworks (Yang et al., 3 Feb 2026, Sinha et al., 29 Aug 2025).

1. Motivation and Background

Q-learning with linear function approximation is known to diverge in general off-policy settings, especially when the feature covariance matrix $\Phi^\top D \Phi$ is ill-conditioned (i.e., small eigenvalues) and traditional projected Bellman iteration is used. This effect, part of the “deadly triad,” exposes instability due to the lack of contraction properties in the unregularized projected Bellman operator.

Regularization has been previously proposed to “lift” the smallest eigenvalues, thereby ensuring strict contraction of the operator. However, full regularization at every iteration leads to a persistent bias in the fixed-point solution (of order $O(\lambda)$ ). PRQ introduces regularization only at fixed periodic intervals, preserving stability with minimal increase in fixed-point bias (Yang et al., 3 Feb 2026).

In partially observable or agent-state-based RL (where the Q-value updates are based on a latent or recurrent “agent state” rather than the environment state), an analogous need for regularization exists. “Periodic regularized agent-state-based Q-learning” (RePASQL) applies the PRQ principles to these more general regimes (Sinha et al., 29 Aug 2025).

2. Mathematical Formulation

Given a discounted MDP $(\mathcal S, \mathcal A, P, r, \gamma)$ and a linear feature map $\phi: \mathcal S \times \mathcal A \to \mathbb{R}^d$ , approximate Q-functions as $Q(s,a) \approx \phi(s,a)^\top \theta$ .

Let $D$ be the diagonal matrix of visitation weights and $\Phi$ the feature matrix. Define:

Unregularized projection:

$\Pi f = \Phi (\Phi^\top D \Phi )^{-1} \Phi^\top D f$

Regularized projection ( $\lambda > 0$ ):

$\Pi_\lambda f = \Phi (\Phi^\top D \Phi + \lambda I_d )^{-1} \Phi^\top D f$

Bellman operator:

$(\mathcal{T} Q)(s,a) = \mathbb{E}[ r(s,a) + \gamma \max_{a'} Q(s', a') \mid s, a ]$

Regularized projected value iteration (RP-VI):

$Q_{k+1} = \Pi_\lambda \mathcal{T} Q_k$

This is a contraction but converges to a fixed point biased by $O(\lambda)$ .

The periodic regularized scheme alternates between unregularized and regularized projections with fixed period $p$ :

$G_k = \begin{cases} \Pi_\lambda \mathcal{T}, & k \equiv 0 \pmod p \ \Pi \mathcal{T}, & \text{otherwise} \end{cases}$

Idealized batch iteration: $Q_{k+1} = G_k Q_k$ .

A stochastic approximation version is given by alternating between vanilla temporal difference updates and periodic regularized projections in the parameter space (Yang et al., 3 Feb 2026).

In POMDPs or RL with agent-state $z_t \in Z$ , PRQ maintains a collection of $L$ periodic Q-tables $\{ Q_t^\ell(z, a) \}$ , one for each phase in a cycle of length $L$ , with policy regularization (e.g., via entropy). The update employs the convex conjugate of the regularizer, replacing the standard $\max_{a'} Q$ term with the regularized “soft-max” (Sinha et al., 29 Aug 2025).

3. Algorithmic Details

A typical online PRQ algorithm proceeds as follows (Yang et al., 3 Feb 2026):

Observe sample $(s_t, a_t, r_t, s'_t)$
Form TD target: $y_t \leftarrow r_t + \gamma \max_{a'} \phi(s'_t, a')^\top \theta_t$
Unprojected update: $\bar\theta_{t+1} \leftarrow \theta_t + \alpha_t \phi(s_t, a_t)( y_t - \phi(s_t, a_t)^\top \theta_t )$
Every $p$ steps, perform “regularized projection”: $\theta_{t+1} \leftarrow (C + \lambda I_d)^{-1} C \bar\theta_{t+1}$ , where $C = \Phi^\top D \Phi$ ; otherwise, set $\theta_{t+1} \leftarrow \bar\theta_{t+1}$

In agent-state RL, at each time $t$ , let $\ell = t \bmod L$ , and update only $Q_t^\ell(z, a)$ . The update is:

$Q^\ell_{t+1}(z,a) = Q^\ell_t(z,a) + \alpha^\ell_t(z,a) \Big[ r_t + \gamma \Omega^*( Q^{(\ell+1)\bmod L}_t(z_{t+1}, \cdot) ) - Q^\ell_t(z,a) \Big]$

where $\Omega^*(\cdot)$ is the convex conjugate of the policy regularizer (with entropy regularization yielding the soft-Q update) (Sinha et al., 29 Aug 2025).

4. Theoretical Properties and Convergence

PRQ achieves geometric convergence toward the optimal projected Q-function up to decomposable noise and bias terms. Under mild assumptions (bounded feature norms and rewards, decaying step-sizes, full-rank covariance), the main finite-time guarantee is:

$\| Q_T - Q^* \|_D \leq \rho^{T/p} \|Q_0 - Q^*\|_D + B_{\mathrm{noise}} + B_{\mathrm{reg}}$

where:

$\rho < 1$ is an explicit contraction parameter,
$B_{\mathrm{noise}} = O \left( \sqrt{ \frac{ d \log(1/\delta) }{ T^{1-2 \beta} } } \right )$ for batch size $T$ ,
$B_{\mathrm{reg}} = O ( \lambda / (1 - \gamma) )$ is the regularization bias.

The regularized Bellman operator is a contraction for any $\lambda > 0$ : for all $Q, Q'$ ,

$\| \Pi_\lambda \mathcal{T} Q - \Pi_\lambda \mathcal{T} Q' \|_D \leq \kappa(\lambda) \gamma \| Q - Q' \|_D,$

with $\kappa(\lambda) = \| ( C + \lambda I )^{-1} C \| \leq 1$ (Yang et al., 3 Feb 2026).

In the agent-state RL setting, under periodic policy and exploration, the convergence is to the unique fixed point of a cycle-composed $\gamma^L$ -contraction operator in each “phase” of the cycle. Under periodic step-sizes and ergodic exploratory policies, almost-sure convergence of the Q-tables to their respective fixed points is guaranteed (Sinha et al., 29 Aug 2025).

5. Sample Complexity and Bias–Variance Trade-offs

To ensure $\| Q_T - Q^* \|_D \leq \varepsilon$ , it suffices to select:

$\lambda = O( (1-\gamma) \varepsilon )$ to control bias,
$T = \tilde O ( d \varepsilon^{-2} )$ to suppress noise,
$T/p = O( \log(1/\varepsilon) )$ to contract the initial error.

The total sample complexity, up to logarithmic factors, is:

$T = \tilde O \left( d \varepsilon^{-2} + p \log( 1/\varepsilon ) \right ), \qquad \lambda = \Theta( (1-\gamma) \varepsilon)$

For $p=O(1)$ , this matches the minimax rates of unregularized Q-learning in the best-case regime.

Smaller $\lambda$ reduces long-term bias but pushes $\kappa(\lambda) \to 1$ , making contraction slower. Larger $p$ reduces computational burden but can slow convergence per full contraction cycle. Empirically, $p \in [10,100]$ and $\lambda \approx (1-\gamma)\varepsilon$ typically offer a favorable trade-off (Yang et al., 3 Feb 2026).

6. Practical Implementation and Empirical Behavior

Key algorithmic steps for PRQ are summarized as follows:

Step	Operation	Frequency
TD Update	Stochastic gradient on TD error	Every iteration
Projection	Regularized $(C+\lambda I_d)^{-1} C$	Every $p$ iterations
Unprojection	No explicit projection (identity)	All other iterations

Empirical evaluation on synthetic and benchmark environments demonstrates:

PRQ eliminates divergence and large oscillations observed in linear Q-learning under function approximation, particularly in off-policy regimes.
The bias–variance trade-off is explicit: increasing $\lambda$ yields more stable (lower-variance) learning but increases fixed-point bias.
Computational savings are substantial: periodic regularization (e.g., $p=50$ ) achieves near-identical accuracy to full regularization ( $p=1$ ) with a fraction of the matrix-inversion overhead (Yang et al., 3 Feb 2026).

In partially observable or agent-state-based tasks, periodic regularized Q-learning achieves almost-sure convergence to the fixed points determined by an induced “regularized pseudo-MDP,” with experimental results aligning closely with theory (Sinha et al., 29 Aug 2025).

7. Extensions and Connections

PRQ establishes a general and practical template for stabilizing TD algorithms under function approximation, extending to settings with recurrent agent states and partial observability. The methodology encompasses both projected value iteration and sample-based stochastic approximation, and the contraction-based analysis framework provides clear design guidelines: regulate only as much as needed for stability.

In the agent-state literature, periodic regularization interacts fruitfully with trainable embedding schemes (e.g., RNN state updates) and with entropy-regularized or soft-Q policy learning, demonstrating robust convergence even under complex, non-Markov, or highly stochastic exploration patterns (Sinha et al., 29 Aug 2025).

Further generalizations may include adaptive selection of $\lambda$ and $p$ (e.g., via cross-validation) and integration with deep RL architectures. A plausible implication is that periodic regularization provides a computationally efficient and theoretically sound mechanism for stabilizing high-dimensional off-policy RL under function approximation.

Markdown Report Issue Upgrade to Chat

References (2)

Periodic Regularized Q-Learning (2026)

Convergence of regularized agent-state-based Q-learning in POMDPs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Periodic Regularized Q-learning (PRQ).

Periodic Regularized Q-Learning (PRQ)

1. Motivation and Background

2. Mathematical Formulation

3. Algorithmic Details

4. Theoretical Properties and Convergence

5. Sample Complexity and Bias–Variance Trade-offs

6. Practical Implementation and Empirical Behavior

7. Extensions and Connections

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Periodic Regularized Q-Learning (PRQ)

1. Motivation and Background

2. Mathematical Formulation

3. Algorithmic Details

4. Theoretical Properties and Convergence

5. Sample Complexity and Bias–Variance Trade-offs

6. Practical Implementation and Empirical Behavior

7. Extensions and Connections

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research