Regularized Projected Value Iteration (RP-VI)

Updated 10 February 2026

Regularized Projected Value Iteration (RP-VI) is an RL framework that stabilizes the projection step using strongly convex regularization for reliable convergence.
It ensures contraction and a unique fixed point by integrating convex regularizers, balancing the bias-variance trade-off in linear function approximation.
RP-VI underpins periodic regularized Q-learning by extending its robust, sample-based techniques to off-policy reinforcement learning with finite-time guarantees.

Regularized Projected Value Iteration (RP-VI) is an algorithmic framework in reinforcement learning (RL) that augments traditional projected value iteration with strongly convex regularization at the level of the projection operator. It is motivated by the instability of classical projected value iteration under linear function approximation and provides rigorous guarantees for contraction, existence of a unique fixed point, and finite-time error bounds. RP-VI is central to the development of periodic regularized Q-learning (PRQ), where its contraction and regularization techniques are carried over to sample-based, off-policy RL with Q-function approximation (Yang et al., 3 Feb 2026).

1. Problem Setting and Formalization

The discounted Markov decision process (MDP) is defined by a finite (or countable) state space $S$ , a finite action space $A$ , transition kernel $P(s'|s,a)$ , and a bounded reward function $r: S \times A \to \mathbb{R}$ with $|r(s,a)| \leq R_{\max}$ . The discount factor is $\gamma \in [0,1)$ . The value function $v: S \to \mathbb{R}$ approximates the optimal value $v^*$ solving the Bellman optimality equation. Under linear function approximation, one fixes a feature map $\varphi: S \to \mathbb{R}^d$ , compiling state features in a matrix $\Phi \in \mathbb{R}^{|S| \times d}$ , with the approximation $v \approx \Phi w$ for parameter $w \in \mathbb{R}^d$ . A state-distribution $\mu$ induces a weighted inner product $\langle v, u \rangle_\mu = \sum_{s \in S} \mu(s) \, v(s)\, u(s)$ and $\|v\|_\mu = \sqrt{\langle v,v \rangle_\mu}$ .

2. Classical Projected Bellman Operator and Regularization

The Bellman operator

$(T v)(s) = \max_{a \in A} \left[r(s,a) + \gamma\, \mathbb{E}_{s' \sim P(\cdot|s,a)}[v(s')]\right]$

is $\gamma$ -contractive in the $\|\cdot\|_\infty$ norm, and also in the $\mu$ -weighted 2-norm. The classical projected Bellman operator is given by $\Pi \circ T$ , where the projection $\Pi$ onto the span of $\varphi$ is

$\Pi v = \Phi w_*, \quad w_* = \arg\min_{w \in \mathbb{R}^d} \|\Phi w - v\|_\mu^2.$

To stabilize this procedure, RP-VI introduces the regularized projection operator $\Pi_\lambda$ : $\Pi_\lambda v = \Phi w_\lambda, \quad w_\lambda = \arg\min_{w \in \mathbb{R}^d} \left\{ \|\Phi w - v\|_\mu^2 + \lambda R(w) \right\},$ for convex regularizer $R(w)$ and penalty $\lambda > 0$ , commonly $R(w) = \|w\|_2^2$ (ridge regression).

3. RP-VI Iteration and Pseudocode

RP-VI begins at $v_0 = \Phi w_0$ and iterates for $k = 0,1,2,\ldots$ through the recursive update

$v_{k+1} = \Pi_\lambda (T v_k),$

or in parameter space,

$w_{k+1} = \arg\min_{w} \left\{ \|\Phi w - T(\Phi w_k)\|_\mu^2 + \lambda R(w) \right\}.$

The canonical pseudocode is:

Step	Operation	Output
1	Evaluate $v_k \leftarrow \Phi w_k$	Value approximation
2	Compute Bellman backup: $T v_k$	Updated value per state
3	Solve regularized least squares: $w_{k+1} \leftarrow \arg\min_{w}\{\sum_{s} \mu(s)[\varphi(s)^\top w - (T v_k)(s)]^2 + \lambda R(w)\}$	Next iterate

Step 3 projects the Bellman update back into the span of $\Phi$ with convex regularization.

4. Contraction Properties and Theoretical Guarantees

A key property is that under mild conditions, the mapping $v \mapsto \Pi_\lambda T v$ is a contraction in $\|\cdot\|_\mu$ . Specifically, if $R$ is $\sigma_R$ -strongly convex and the feature covariance matrix $C = \Phi^\top D_\mu \Phi$ has smallest eigenvalue $\alpha > 0$ ,

$\|\Pi_\lambda T v - \Pi_\lambda T u\|_\mu \leq \rho\, \|v-u\|_\mu, \quad \rho = \frac{\gamma}{1 + \lambda \sigma_R \alpha} < 1.$

When $R(w) = \|w\|_2^2$ ( $\sigma_R=2$ ), this specializes to $\rho = \gamma / (1 + 2\lambda \alpha)$ . These properties hold under:

Full rank feature covariance ( $\lambda_{\min}(C)>0$ )
Bounded features ( $\|\varphi(s)\|_2 \leq B$ )
Convex, strongly convex $R$
$\lambda > 0$ ensuring $\rho < 1$

This strong contraction leads to linear convergence and well-posedness of every projection step.

5. Finite-Time Error Analysis

Let $v_\lambda$ be the unique fixed point of $v = \Pi_\lambda T v$ . Then for all $k \geq 0$ ,

$\|v_k - v_\lambda\|_\mu \leq \rho^k \|v_0 - v_\lambda\|_\mu,$

and $v_\lambda$ itself is an $\epsilon$ -approximation of $v^*$ with $\epsilon = O(\lambda)$ . Together, the finite-time error bound is

$\|v_k - v^*\|_\mu \leq \rho^k \|v_0 - v^*\|_\mu + \frac{1}{1-\rho} \|v_\lambda - v^*\|_\mu.$

For $R(w)=\|w\|_2^2$ , this yields

$\|v_\lambda - v^*\|_\mu \leq \frac{\gamma}{1-\gamma} \frac{\lambda \|w^*\|_2}{\alpha},$

where $w^*$ projects $v^*$ without regularization. The error decays geometrically in $k$ , with $O(\lambda)$ bias vanishing as $\lambda \rightarrow 0$ .

6. Relation to Sample-Based Algorithms and PRQ

The RP-VI framework presumes access to exact Bellman backups and projections. Periodic Regularized Q-Learning (PRQ) extends RP-VI to off-policy, sample-based Q-learning for Q-functions by:

Periodically collecting a batch of $N$ transitions $\{(s_i,a_i,r_i,s'_i)\}$ .
Constructing empirical targets $y_i = r_i + \gamma \max_{b \in A} \varphi(s'_i,b)^\top w_k$ using a target network held fixed across the batch.
Solving the finite-sample regularized regression

$\hat w_{k+1} = \arg\min_{w} \Bigg\{ \frac{1}{N} \sum_{i=1}^N \left[ \varphi(s_i, a_i)^\top w - y_i \right]^2 + \lambda R(w) \Bigg\}.$

Updating the target parameter $w_k$ every $N$ steps (the "periodic" aspect).

This process replicates RP-VI's $\Pi_\lambda T$ but uses empirical Bellman targets and sample-based regression. Under standard mixing and sample size conditions, finite-time high-probability guarantees extend RP-VI's error bounds to PRQ.

7. Implications and Bias-Variance Control

RP-VI systematically replaces the non-regularized projection step with a $\lambda$ -strongly convex projection, ensuring strict contraction $\rho < \gamma$ and uniqueness of $v_\lambda$ . The parameter $\lambda$ directly mediates the bias–variance trade-off: increasing $\lambda$ favors stability and contraction at the expense of introducing $O(\lambda)$ bias. As $\lambda$ decreases, bias vanishes but the contraction factor approaches $\gamma$ . A plausible implication is that RP-VI enables practitioners to select $\lambda$ to attain desired rates of convergence and regularization-induced smoothing, particularly in high-dimensional or ill-conditioned feature regimes (Yang et al., 3 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Periodic Regularized Q-Learning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Regularized Projected Value Iteration (RP-VI).

Regularized Projected Value Iteration (RP-VI)

1. Problem Setting and Formalization

2. Classical Projected Bellman Operator and Regularization

3. RP-VI Iteration and Pseudocode

4. Contraction Properties and Theoretical Guarantees

5. Finite-Time Error Analysis

6. Relation to Sample-Based Algorithms and PRQ

7. Implications and Bias-Variance Control

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Regularized Projected Value Iteration (RP-VI)

1. Problem Setting and Formalization

2. Classical Projected Bellman Operator and Regularization

3. RP-VI Iteration and Pseudocode

4. Contraction Properties and Theoretical Guarantees

5. Finite-Time Error Analysis

6. Relation to Sample-Based Algorithms and PRQ

7. Implications and Bias-Variance Control

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research