Papers
Topics
Authors
Recent
Search
2000 character limit reached

V-Trace Off-Policy Actor-Critic Algorithm

Updated 2 February 2026
  • The algorithm’s main contribution is stabilizing deep reinforcement learning by applying truncated importance-sampling corrections to reconcile off-policy data with policy lag.
  • It enhances training efficiency in distributed settings by explicitly managing the bias-variance trade-off through controlled hyperparameters.
  • Empirical performance on benchmarks like Atari and DMLab and its adaptability in multi-agent extensions validate its robust and scalable design.

The V-Trace off-policy actor-critic algorithm is a widely adopted approach in deep reinforcement learning for stabilizing and accelerating training in scenarios with policy lag, distributed data collection, or experience replay. The algorithm leverages importance-sampling corrections with carefully chosen truncation to reconcile updates from off-policy trajectories, enabling robust actor-critic learning even with stale or highly heterogeneous data. V-Trace serves as the foundation for distributed agents such as IMPALA, LASER, and subsequent multi-agent extensions, and is characterized by explicit bias-variance trade-off control, efficient implementation, and strong empirical performance in large-scale benchmarks such as Atari and DMLab.

1. Formal Definition and Core Mechanism

V-Trace operates in a Markov Decision Process (MDP) where agent-environment interaction is driven by a behavior policy μ(as)\mu(a|s), while the objective is to evaluate or improve a (parameterized) target policy πθ(as)\pi_\theta(a|s) with potentially nonzero policy lag. The method introduces two levels of importance-sampling ratio truncation:

  • Truncated correction weight: For each step tt, define

ρt:=min(ρˉ,πθ(atst)μ(atst))\rho_t := \min(\bar\rho,\, \tfrac{\pi_\theta(a_t|s_t)}{\mu(a_t|s_t)})

  • Truncated trace weight: Similarly,

ct:=min(cˉ,πθ(atst)μ(atst))c_t := \min(\bar{c},\, \tfrac{\pi_\theta(a_t|s_t)}{\mu(a_t|s_t)})

where typically ρˉ,cˉ>0\bar\rho, \bar c > 0 and often ρˉ=cˉ=1.0\bar\rho = \bar c = 1.0.

Given a value function Vϕ(s)V_\phi(s) (the critic), the nn-step V-Trace target vtv_t at time tt for an episode of length Tt+nT \ge t + n is

vt=Vϕ(st)+u=tt+n1γut(i=tu1ci)ρu[ru+γVϕ(su+1)Vϕ(su)]v_t = V_\phi(s_t) + \sum_{u=t}^{t+n-1} \gamma^{u-t} \Bigl(\prod_{i=t}^{u-1} c_i\Bigr) \cdot \rho_u \cdot [r_u + \gamma V_\phi(s_{u+1}) - V_\phi(s_u)]

with γ\gamma the discount factor, and convention i=tt1ci:=1\prod_{i=t}^{t-1} c_i := 1.

The actor is updated by a policy gradient step, leveraging advantage estimates obtained from the V-Trace returns:

At=vtVϕ(st),A_t = v_t - V_\phi(s_t)\,,

and the surrogate gradient is:

G^=t=0T1ρtθlogπθ(atst)At.\hat{G} = \sum_{t=0}^{T-1} \rho_t \nabla_\theta \log \pi_\theta(a_t|s_t) A_t\,.

Optionally, an entropy bonus is included to promote exploration:

Lactor=tρtlogπθ(atst)AtcenttH(πθ(st))L_{\text{actor}} = - \sum_t \rho_t \log \pi_\theta(a_t|s_t) A_t - c_{\text{ent}} \sum_t H(\pi_\theta(\cdot|s_t))

where HH denotes the policy entropy and centc_{\text{ent}} the entropy regularization scale.

2. Pseudocode and Implementation Structure

The standard single-agent V-Trace actor-critic procedure can be summarized as follows (Zawalski et al., 2021, Chen et al., 2022, Schmitt et al., 2019):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Initialize policy-network θ, value-network φ.
Set hyperparams ρ̄, c̄, γ, n-step, batch size B, learning rates, entropy weight schedule.
while not converged:
    # Data collection
    Collect N trajectories under behavior policy μ (possibly a lagged copy of π_θ).
    for t in 0...T-1:
        r_t = π_θ(a_t|s_t) / μ(a_t|s_t)
        ρ_t = min̄, r_t)
        c_t = min(c̄, r_t)
    # Critic target computation
    For each t, compute V-Trace target v_t as above
    # Advantage
    A_t = v_t - V_φ(s_t)
    # Critic step
    φ  φ - α_critic _φ (1/B)_t (v_t - V_φ(s_t))^2
    # Actor step
    θ  θ + α_actor (1/B)_t [ ρ_t _θ log π_θ(a_t|s_t)·A_t  +  c_entH(π_θ(·|s_t)) ]
    Optionally update μ  π_θ

In distributed architectures, many actors collect trajectories in parallel under various policy lags. The V-Trace correction ensures stability of the learner even as μ diverges from π_θ (Schmitt et al., 2019, Zawalski et al., 2021).

3. Bias-Variance Trade-Off and Theoretical Properties

V-Trace explicitly weights the trade-off between bias and variance via the choice of truncation levels (ρˉ,cˉ)(\bar\rho,\,\bar c). Setting high truncation (large ρˉ,cˉ\bar\rho,\,\bar c) reduces bias but allows high variance, as importance ratios can grow unbounded if μ is very different from π. Setting lower truncation (e.g., ρˉ=cˉ=1\bar\rho = \bar c = 1) yields low-variance estimates but increases bias toward an effective "implied policy":

π~(as)min[ρˉμ(as),π(as)]\tilde{\pi}(a|s) \propto \min[\bar\rho\,\mu(a|s),\,\pi(a|s)]

(as shown in Proposition 1 of (Schmitt et al., 2019)), so V-Trace returns converge to the value of π~μ\tilde{\pi}_\mu, not π per se.

This bias can be strictly quantified; for example, increasing ρˉ\bar\rho systematically reduces bias, at the cost of higher variance in the multi-step product cj\prod c_j. On-policy learning (μ = π) is always unbiased. Mixed on-policy/off-policy batches restore policy optimality under mild state-visitation conditions (Schmitt et al., 2019). Empirically, settings with ρˉ=cˉ=1.0\bar\rho = \bar c = 1.0 are preferred due to effective variance reduction and favorable sample efficiency on benchmarks.

The V-Trace critic update forms a contractive operator (see “trust-region IS” in (Schmitt et al., 2019, Chen et al., 2022)), guaranteeing geometric convergence to a unique fixed point under ergodic behavior policy.

4. Sample Complexity and Convergence Guarantees

Under standard assumptions (ergodic μ, compatible function approximation, contraction of the Bellman-trace operator), V-Trace actor-critic achieves a total sample complexity

T×K=O~(ϵ2)T \times K = \tilde O(\epsilon^{-2})

to reach policy πθT\pi_{\theta_T} such that EQQπθTϵ\mathbb{E} \| Q^* - Q^{\pi_{\theta_T}}\|_\infty \le \epsilon (Chen et al., 2022). The bias from truncated IS ratios is O(1/ρˉ)O(1/\bar\rho) and thus can be controlled by increasing ρˉ\bar\rho. The critic iterates converge at 2\ell_2-rate O((1η)t)+O(αwlog(1/αw))O((1-\eta)^t) + O(\alpha_w \log(1/\alpha_w)), and the actor converges geometrically with rate dictated by rollout length KK and step-size schedules.

The approach thus matches the minimax lower-bounds for policy-based and QQ-learning methods, up to logarithmic factors, even in the presence of off-policy sampling and linear function approximation (Chen et al., 2022).

5. Experience Replay, Distributed Training, and Practical Considerations

The algorithm is designed to support uniform large-scale experience replay and distributed architectures with policy lag (Schmitt et al., 2019). Trajectories collected both on-policy (current π_θ) and off-policy (older μ) are pooled in shared replay, and V-Trace corrections compensate for nonstationarity. Stability is further improved by:

  • Mixing on-policy and replay: Each learner batch contains a fixed fraction α of on-policy trajectories to mitigate policy bias. In practice, α = 0.125 (12.5% on-policy) is effective.
  • Trust-region clipping: Highly off-policy transitions are censored using a KL-divergence trust region β(π,μ,s)=KL[π(s)π~μ(s)]\beta(\pi,\mu,s) = \text{KL}[\pi(\cdot|s) \Vert \tilde{\pi}_\mu(\cdot|s)]; multi-step traces are truncated at states exceeding a threshold.

Empirical studies confirm that these variants yield robust performance and state-of-the-art efficiency, with shared replay and distributed sampling further enhancing exploration and data efficiency on large benchmarks (Schmitt et al., 2019).

6. Hyperparameterization and Implementation Details

Critical hyperparameters and recommended settings, as distilled from distributed benchmarks, are tabulated below:

Parameter Recommended Value Role/Notes
ρˉ,cˉ\bar\rho, \bar c 1.0 Truncation caps for importance weights
γ\gamma 0.99 Discount factor
nn 20 (typical) n-step unroll length
αactor,αcritic\alpha_{\text{actor}}, \alpha_{\text{critic}} 1×103,  7×1041\times10^{-3},\; 7\times10^{-4} Learning rates
centc_{\text{ent}} annealed 1.01051.0\to10^{-5} Entropy regularization
Batch size 32 (on-policy), >200>200 (replay) Strategic batch mixing
Replay buffer 10710^7 frames Scalability

Optimizers such as Adam and RMSProp are typically used. For architectures, IMPALA-style convolutions with LSTM heads are effective for pixel-based environments, with LSTM state recomputation from episode start for each batch. No gradient clipping is required in the standard setup (Schmitt et al., 2019, Zawalski et al., 2021).

7. Extensions, Variants, and Relations

V-Trace provides the foundation for several generalizations:

  • MA-Trace: A direct multi-agent extension with distributed corrections, proven fixed-point convergence (Zawalski et al., 2021).
  • Q-Trace: An explicit Bellman-equation-based modification serving as a drop-in critic for off-policy natural actor-critic (NAC) with O(ϵ3log2(1/ϵ))\mathcal{O}(\epsilon^{-3}\log^2(1/\epsilon)) sample complexity (Khodadadian et al., 2021).
  • Lambda-averaged and two-sided Q-Trace: Multi-step, generalized IS correctors with finite-sample analysis in function approximation settings (Chen et al., 2022).

The critical distinction lies in the placement and form of importance weighting and the specific operator fixed point (V-Trace converges to an “implied” policy’s value; Q-Trace converges to a modified Bellman fixed point not generally corresponding to any policy). These differences inform both practical implementation and theoretical convergence under off-policy sampling.


References:

(Khodadadian et al., 2021, Chen et al., 2022, Schmitt et al., 2019, Zawalski et al., 2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to V-Trace Off-Policy Actor-Critic Algorithm.