Periodic Regularized Q-Learning (PRQ)
- Periodic Regularized Q-Learning (PRQ) is a reinforcement learning method that introduces regularization at fixed intervals to improve the stability of Q-learning with function approximation.
- The algorithm alternates between standard TD updates and periodic regularized projections, mitigating divergence and reducing variance in off-policy scenarios.
- Empirical results and theoretical analysis demonstrate that PRQ effectively balances contraction properties with minimal fixed-point bias, even in complex agent-state and partially observable settings.
Periodic Regularized Q-Learning (PRQ) encompasses a class of reinforcement learning (RL) algorithms designed to stabilize Q-learning under function approximation by incorporating periodic regularization at the projection level. The core motivation is to achieve robust convergence—especially in off-policy or ill-conditioned scenarios—while mitigating the “regularization bias” introduced by continual regularization. This approach balances the competing objectives of contraction-based stability and fixed-point accuracy, and is applicable in both standard Markov Decision Processes (MDPs) and more general agent-state (e.g., recurrent) frameworks (Yang et al., 3 Feb 2026, Sinha et al., 29 Aug 2025).
1. Motivation and Background
Q-learning with linear function approximation is known to diverge in general off-policy settings, especially when the feature covariance matrix is ill-conditioned (i.e., small eigenvalues) and traditional projected Bellman iteration is used. This effect, part of the “deadly triad,” exposes instability due to the lack of contraction properties in the unregularized projected Bellman operator.
Regularization has been previously proposed to “lift” the smallest eigenvalues, thereby ensuring strict contraction of the operator. However, full regularization at every iteration leads to a persistent bias in the fixed-point solution (of order ). PRQ introduces regularization only at fixed periodic intervals, preserving stability with minimal increase in fixed-point bias (Yang et al., 3 Feb 2026).
In partially observable or agent-state-based RL (where the Q-value updates are based on a latent or recurrent “agent state” rather than the environment state), an analogous need for regularization exists. “Periodic regularized agent-state-based Q-learning” (RePASQL) applies the PRQ principles to these more general regimes (Sinha et al., 29 Aug 2025).
2. Mathematical Formulation
Given a discounted MDP and a linear feature map , approximate Q-functions as .
Let be the diagonal matrix of visitation weights and the feature matrix. Define:
- Unregularized projection:
- Regularized projection ():
- Bellman operator:
This is a contraction but converges to a fixed point biased by .
The periodic regularized scheme alternates between unregularized and regularized projections with fixed period :
Idealized batch iteration: .
A stochastic approximation version is given by alternating between vanilla temporal difference updates and periodic regularized projections in the parameter space (Yang et al., 3 Feb 2026).
In POMDPs or RL with agent-state , PRQ maintains a collection of periodic Q-tables , one for each phase in a cycle of length , with policy regularization (e.g., via entropy). The update employs the convex conjugate of the regularizer, replacing the standard term with the regularized “soft-max” (Sinha et al., 29 Aug 2025).
3. Algorithmic Details
A typical online PRQ algorithm proceeds as follows (Yang et al., 3 Feb 2026):
- Observe sample
- Form TD target:
- Unprojected update:
- Every steps, perform “regularized projection”: , where ; otherwise, set
In agent-state RL, at each time , let , and update only . The update is:
where is the convex conjugate of the policy regularizer (with entropy regularization yielding the soft-Q update) (Sinha et al., 29 Aug 2025).
4. Theoretical Properties and Convergence
PRQ achieves geometric convergence toward the optimal projected Q-function up to decomposable noise and bias terms. Under mild assumptions (bounded feature norms and rewards, decaying step-sizes, full-rank covariance), the main finite-time guarantee is:
where:
- is an explicit contraction parameter,
- for batch size ,
- is the regularization bias.
The regularized Bellman operator is a contraction for any : for all ,
with (Yang et al., 3 Feb 2026).
In the agent-state RL setting, under periodic policy and exploration, the convergence is to the unique fixed point of a cycle-composed -contraction operator in each “phase” of the cycle. Under periodic step-sizes and ergodic exploratory policies, almost-sure convergence of the Q-tables to their respective fixed points is guaranteed (Sinha et al., 29 Aug 2025).
5. Sample Complexity and Bias–Variance Trade-offs
To ensure , it suffices to select:
- to control bias,
- to suppress noise,
- to contract the initial error.
The total sample complexity, up to logarithmic factors, is:
For , this matches the minimax rates of unregularized Q-learning in the best-case regime.
Smaller reduces long-term bias but pushes , making contraction slower. Larger reduces computational burden but can slow convergence per full contraction cycle. Empirically, and typically offer a favorable trade-off (Yang et al., 3 Feb 2026).
6. Practical Implementation and Empirical Behavior
Key algorithmic steps for PRQ are summarized as follows:
| Step | Operation | Frequency |
|---|---|---|
| TD Update | Stochastic gradient on TD error | Every iteration |
| Projection | Regularized | Every iterations |
| Unprojection | No explicit projection (identity) | All other iterations |
Empirical evaluation on synthetic and benchmark environments demonstrates:
- PRQ eliminates divergence and large oscillations observed in linear Q-learning under function approximation, particularly in off-policy regimes.
- The bias–variance trade-off is explicit: increasing yields more stable (lower-variance) learning but increases fixed-point bias.
- Computational savings are substantial: periodic regularization (e.g., ) achieves near-identical accuracy to full regularization () with a fraction of the matrix-inversion overhead (Yang et al., 3 Feb 2026).
In partially observable or agent-state-based tasks, periodic regularized Q-learning achieves almost-sure convergence to the fixed points determined by an induced “regularized pseudo-MDP,” with experimental results aligning closely with theory (Sinha et al., 29 Aug 2025).
7. Extensions and Connections
PRQ establishes a general and practical template for stabilizing TD algorithms under function approximation, extending to settings with recurrent agent states and partial observability. The methodology encompasses both projected value iteration and sample-based stochastic approximation, and the contraction-based analysis framework provides clear design guidelines: regulate only as much as needed for stability.
In the agent-state literature, periodic regularization interacts fruitfully with trainable embedding schemes (e.g., RNN state updates) and with entropy-regularized or soft-Q policy learning, demonstrating robust convergence even under complex, non-Markov, or highly stochastic exploration patterns (Sinha et al., 29 Aug 2025).
Further generalizations may include adaptive selection of and (e.g., via cross-validation) and integration with deep RL architectures. A plausible implication is that periodic regularization provides a computationally efficient and theoretically sound mechanism for stabilizing high-dimensional off-policy RL under function approximation.