Regularized Projected Value Iteration (RP-VI)
- Regularized Projected Value Iteration (RP-VI) is an RL framework that stabilizes the projection step using strongly convex regularization for reliable convergence.
- It ensures contraction and a unique fixed point by integrating convex regularizers, balancing the bias-variance trade-off in linear function approximation.
- RP-VI underpins periodic regularized Q-learning by extending its robust, sample-based techniques to off-policy reinforcement learning with finite-time guarantees.
Regularized Projected Value Iteration (RP-VI) is an algorithmic framework in reinforcement learning (RL) that augments traditional projected value iteration with strongly convex regularization at the level of the projection operator. It is motivated by the instability of classical projected value iteration under linear function approximation and provides rigorous guarantees for contraction, existence of a unique fixed point, and finite-time error bounds. RP-VI is central to the development of periodic regularized Q-learning (PRQ), where its contraction and regularization techniques are carried over to sample-based, off-policy RL with Q-function approximation (Yang et al., 3 Feb 2026).
1. Problem Setting and Formalization
The discounted Markov decision process (MDP) is defined by a finite (or countable) state space , a finite action space , transition kernel , and a bounded reward function with . The discount factor is . The value function approximates the optimal value solving the Bellman optimality equation. Under linear function approximation, one fixes a feature map , compiling state features in a matrix , with the approximation for parameter . A state-distribution induces a weighted inner product and .
2. Classical Projected Bellman Operator and Regularization
The Bellman operator
is -contractive in the norm, and also in the -weighted 2-norm. The classical projected Bellman operator is given by , where the projection onto the span of is
To stabilize this procedure, RP-VI introduces the regularized projection operator : for convex regularizer and penalty , commonly (ridge regression).
3. RP-VI Iteration and Pseudocode
RP-VI begins at and iterates for through the recursive update
or in parameter space,
The canonical pseudocode is:
| Step | Operation | Output |
|---|---|---|
| 1 | Evaluate | Value approximation |
| 2 | Compute Bellman backup: | Updated value per state |
| 3 | Solve regularized least squares: | Next iterate |
Step 3 projects the Bellman update back into the span of with convex regularization.
4. Contraction Properties and Theoretical Guarantees
A key property is that under mild conditions, the mapping is a contraction in . Specifically, if is -strongly convex and the feature covariance matrix has smallest eigenvalue ,
When (), this specializes to . These properties hold under:
- Full rank feature covariance ()
- Bounded features ()
- Convex, strongly convex
- ensuring
This strong contraction leads to linear convergence and well-posedness of every projection step.
5. Finite-Time Error Analysis
Let be the unique fixed point of . Then for all ,
and itself is an -approximation of with . Together, the finite-time error bound is
For , this yields
where projects without regularization. The error decays geometrically in , with bias vanishing as .
6. Relation to Sample-Based Algorithms and PRQ
The RP-VI framework presumes access to exact Bellman backups and projections. Periodic Regularized Q-Learning (PRQ) extends RP-VI to off-policy, sample-based Q-learning for Q-functions by:
- Periodically collecting a batch of transitions .
- Constructing empirical targets using a target network held fixed across the batch.
- Solving the finite-sample regularized regression
- Updating the target parameter every steps (the "periodic" aspect).
This process replicates RP-VI's but uses empirical Bellman targets and sample-based regression. Under standard mixing and sample size conditions, finite-time high-probability guarantees extend RP-VI's error bounds to PRQ.
7. Implications and Bias-Variance Control
RP-VI systematically replaces the non-regularized projection step with a -strongly convex projection, ensuring strict contraction and uniqueness of . The parameter directly mediates the bias–variance trade-off: increasing favors stability and contraction at the expense of introducing bias. As decreases, bias vanishes but the contraction factor approaches . A plausible implication is that RP-VI enables practitioners to select to attain desired rates of convergence and regularization-induced smoothing, particularly in high-dimensional or ill-conditioned feature regimes (Yang et al., 3 Feb 2026).