Papers
Topics
Authors
Recent
Search
2000 character limit reached

Regularized Projected Value Iteration (RP-VI)

Updated 10 February 2026
  • Regularized Projected Value Iteration (RP-VI) is an RL framework that stabilizes the projection step using strongly convex regularization for reliable convergence.
  • It ensures contraction and a unique fixed point by integrating convex regularizers, balancing the bias-variance trade-off in linear function approximation.
  • RP-VI underpins periodic regularized Q-learning by extending its robust, sample-based techniques to off-policy reinforcement learning with finite-time guarantees.

Regularized Projected Value Iteration (RP-VI) is an algorithmic framework in reinforcement learning (RL) that augments traditional projected value iteration with strongly convex regularization at the level of the projection operator. It is motivated by the instability of classical projected value iteration under linear function approximation and provides rigorous guarantees for contraction, existence of a unique fixed point, and finite-time error bounds. RP-VI is central to the development of periodic regularized Q-learning (PRQ), where its contraction and regularization techniques are carried over to sample-based, off-policy RL with Q-function approximation (Yang et al., 3 Feb 2026).

1. Problem Setting and Formalization

The discounted Markov decision process (MDP) is defined by a finite (or countable) state space SS, a finite action space AA, transition kernel P(ss,a)P(s'|s,a), and a bounded reward function r:S×ARr: S \times A \to \mathbb{R} with r(s,a)Rmax|r(s,a)| \leq R_{\max}. The discount factor is γ[0,1)\gamma \in [0,1). The value function v:SRv: S \to \mathbb{R} approximates the optimal value vv^* solving the Bellman optimality equation. Under linear function approximation, one fixes a feature map φ:SRd\varphi: S \to \mathbb{R}^d, compiling state features in a matrix ΦRS×d\Phi \in \mathbb{R}^{|S| \times d}, with the approximation vΦwv \approx \Phi w for parameter wRdw \in \mathbb{R}^d. A state-distribution μ\mu induces a weighted inner product v,uμ=sSμ(s)v(s)u(s)\langle v, u \rangle_\mu = \sum_{s \in S} \mu(s) \, v(s)\, u(s) and vμ=v,vμ\|v\|_\mu = \sqrt{\langle v,v \rangle_\mu}.

2. Classical Projected Bellman Operator and Regularization

The Bellman operator

(Tv)(s)=maxaA[r(s,a)+γEsP(s,a)[v(s)]](T v)(s) = \max_{a \in A} \left[r(s,a) + \gamma\, \mathbb{E}_{s' \sim P(\cdot|s,a)}[v(s')]\right]

is γ\gamma-contractive in the \|\cdot\|_\infty norm, and also in the μ\mu-weighted 2-norm. The classical projected Bellman operator is given by ΠT\Pi \circ T, where the projection Π\Pi onto the span of φ\varphi is

Πv=Φw,w=argminwRdΦwvμ2.\Pi v = \Phi w_*, \quad w_* = \arg\min_{w \in \mathbb{R}^d} \|\Phi w - v\|_\mu^2.

To stabilize this procedure, RP-VI introduces the regularized projection operator Πλ\Pi_\lambda: Πλv=Φwλ,wλ=argminwRd{Φwvμ2+λR(w)},\Pi_\lambda v = \Phi w_\lambda, \quad w_\lambda = \arg\min_{w \in \mathbb{R}^d} \left\{ \|\Phi w - v\|_\mu^2 + \lambda R(w) \right\}, for convex regularizer R(w)R(w) and penalty λ>0\lambda > 0, commonly R(w)=w22R(w) = \|w\|_2^2 (ridge regression).

3. RP-VI Iteration and Pseudocode

RP-VI begins at v0=Φw0v_0 = \Phi w_0 and iterates for k=0,1,2,k = 0,1,2,\ldots through the recursive update

vk+1=Πλ(Tvk),v_{k+1} = \Pi_\lambda (T v_k),

or in parameter space,

wk+1=argminw{ΦwT(Φwk)μ2+λR(w)}.w_{k+1} = \arg\min_{w} \left\{ \|\Phi w - T(\Phi w_k)\|_\mu^2 + \lambda R(w) \right\}.

The canonical pseudocode is:

Step Operation Output
1 Evaluate vkΦwkv_k \leftarrow \Phi w_k Value approximation
2 Compute Bellman backup: TvkT v_k Updated value per state
3 Solve regularized least squares: wk+1argminw{sμ(s)[φ(s)w(Tvk)(s)]2+λR(w)}w_{k+1} \leftarrow \arg\min_{w}\{\sum_{s} \mu(s)[\varphi(s)^\top w - (T v_k)(s)]^2 + \lambda R(w)\} Next iterate

Step 3 projects the Bellman update back into the span of Φ\Phi with convex regularization.

4. Contraction Properties and Theoretical Guarantees

A key property is that under mild conditions, the mapping vΠλTvv \mapsto \Pi_\lambda T v is a contraction in μ\|\cdot\|_\mu. Specifically, if RR is σR\sigma_R-strongly convex and the feature covariance matrix C=ΦDμΦC = \Phi^\top D_\mu \Phi has smallest eigenvalue α>0\alpha > 0,

ΠλTvΠλTuμρvuμ,ρ=γ1+λσRα<1.\|\Pi_\lambda T v - \Pi_\lambda T u\|_\mu \leq \rho\, \|v-u\|_\mu, \quad \rho = \frac{\gamma}{1 + \lambda \sigma_R \alpha} < 1.

When R(w)=w22R(w) = \|w\|_2^2 (σR=2\sigma_R=2), this specializes to ρ=γ/(1+2λα)\rho = \gamma / (1 + 2\lambda \alpha). These properties hold under:

  • Full rank feature covariance (λmin(C)>0\lambda_{\min}(C)>0)
  • Bounded features (φ(s)2B\|\varphi(s)\|_2 \leq B)
  • Convex, strongly convex RR
  • λ>0\lambda > 0 ensuring ρ<1\rho < 1

This strong contraction leads to linear convergence and well-posedness of every projection step.

5. Finite-Time Error Analysis

Let vλv_\lambda be the unique fixed point of v=ΠλTvv = \Pi_\lambda T v. Then for all k0k \geq 0,

vkvλμρkv0vλμ,\|v_k - v_\lambda\|_\mu \leq \rho^k \|v_0 - v_\lambda\|_\mu,

and vλv_\lambda itself is an ϵ\epsilon-approximation of vv^* with ϵ=O(λ)\epsilon = O(\lambda). Together, the finite-time error bound is

vkvμρkv0vμ+11ρvλvμ.\|v_k - v^*\|_\mu \leq \rho^k \|v_0 - v^*\|_\mu + \frac{1}{1-\rho} \|v_\lambda - v^*\|_\mu.

For R(w)=w22R(w)=\|w\|_2^2, this yields

vλvμγ1γλw2α,\|v_\lambda - v^*\|_\mu \leq \frac{\gamma}{1-\gamma} \frac{\lambda \|w^*\|_2}{\alpha},

where ww^* projects vv^* without regularization. The error decays geometrically in kk, with O(λ)O(\lambda) bias vanishing as λ0\lambda \rightarrow 0.

6. Relation to Sample-Based Algorithms and PRQ

The RP-VI framework presumes access to exact Bellman backups and projections. Periodic Regularized Q-Learning (PRQ) extends RP-VI to off-policy, sample-based Q-learning for Q-functions by:

  • Periodically collecting a batch of NN transitions {(si,ai,ri,si)}\{(s_i,a_i,r_i,s'_i)\}.
  • Constructing empirical targets yi=ri+γmaxbAφ(si,b)wky_i = r_i + \gamma \max_{b \in A} \varphi(s'_i,b)^\top w_k using a target network held fixed across the batch.
  • Solving the finite-sample regularized regression

w^k+1=argminw{1Ni=1N[φ(si,ai)wyi]2+λR(w)}.\hat w_{k+1} = \arg\min_{w} \Bigg\{ \frac{1}{N} \sum_{i=1}^N \left[ \varphi(s_i, a_i)^\top w - y_i \right]^2 + \lambda R(w) \Bigg\}.

  • Updating the target parameter wkw_k every NN steps (the "periodic" aspect).

This process replicates RP-VI's ΠλT\Pi_\lambda T but uses empirical Bellman targets and sample-based regression. Under standard mixing and sample size conditions, finite-time high-probability guarantees extend RP-VI's error bounds to PRQ.

7. Implications and Bias-Variance Control

RP-VI systematically replaces the non-regularized projection step with a λ\lambda-strongly convex projection, ensuring strict contraction ρ<γ\rho < \gamma and uniqueness of vλv_\lambda. The parameter λ\lambda directly mediates the bias–variance trade-off: increasing λ\lambda favors stability and contraction at the expense of introducing O(λ)O(\lambda) bias. As λ\lambda decreases, bias vanishes but the contraction factor approaches γ\gamma. A plausible implication is that RP-VI enables practitioners to select λ\lambda to attain desired rates of convergence and regularization-induced smoothing, particularly in high-dimensional or ill-conditioned feature regimes (Yang et al., 3 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Regularized Projected Value Iteration (RP-VI).