Papers
Topics
Authors
Recent
Search
2000 character limit reached

Truncated Quantile Critics (TQC)

Updated 19 January 2026
  • TQC is an off-policy, distributional reinforcement learning algorithm that uses quantile critics and truncation to mitigate overestimation bias.
  • It employs an ensemble of critics and selectively discards extreme quantile values, ensuring fine-grained, stable value estimation in continuous control tasks.
  • Empirical results demonstrate that TQC significantly improves sample efficiency and performance over benchmarks like SAC and TD3, with up to 46% gains in some environments.

Truncated Quantile Critics (TQC) is an off-policy, distributional reinforcement learning algorithm designed to address overestimation bias in continuous control. TQC combines a quantile-based critic architecture with truncated target distributions and an ensemble of critics, providing fine-grained and stable value estimation. Building upon the Soft Actor-Critic (SAC) framework, TQC achieves substantial improvements in sample efficiency and performance across standard RL benchmarks as well as in power-constrained robotic control.

1. Theoretical Foundations and Problem Motivation

Efficient off-policy actor-critic algorithms frequently suffer from Q-value overestimation. This bias primarily arises from two mechanisms: the maximization operator applied to noisy value estimates (as per Jensen’s inequality) and the accumulation of bootstrapping errors within temporal-difference (TD) learning. In discrete spaces, methods such as Double DQN partially alleviate the bias; in continuous control, Twin Delayed DDPG (TD3) uses a minimum-of-two-critics strategy but provides coarse bias control and can induce underestimation. TQC addresses these limitations by modeling the full distribution of potential returns and applying a truncation operator on ensembles of quantile critics, allowing arbitrary granular control over the degree and effect of bias correction (Kuznetsov et al., 2020, Dorka et al., 2021).

2. Mathematical Formulation

TQC operates in the Markov Decision Process setting with state space S\mathcal{S}, action space A\mathcal{A}, stochastic transition P(ss,a)\mathcal{P}(s' \mid s, a), reward R(s,a)R(s, a), and discount γ[0,1)\gamma \in [0,1). The critic models the full random return distribution Zπ(s,a)=t0γtR(st,at)Z^\pi(s,a) = \sum_{t\ge0}\gamma^t R(s_t, a_t) via a quantile representation:

Zψ(s,a)=1Mm=1Mδ(θψm(s,a))Z_\psi(s,a) = \frac{1}{M} \sum_{m=1}^M \delta\left(\theta^m_\psi(s,a)\right)

with sorted, learnable quantile locations {θψm}m=1M\{\theta^m_\psi\}_{m=1}^M. Training minimizes the 1-Wasserstein (quantile regression) loss using the asymmetric quantile Huber loss:

ρτH(u)=τ1{u<0}LH1(u),LH1(u)={12u2u1 u12u>1\rho^H_\tau(u) = |\tau - \mathbf{1}_{\{u<0\}}|\, \mathcal{L}^1_H(u),\quad \mathcal{L}^1_H(u) = \begin{cases} \tfrac{1}{2}u^2 & |u|\le1\ |u| - \tfrac{1}{2} & |u|>1 \end{cases}

For enhanced robustness, TQC maintains NN independent quantile critics, each parameterized as Zψn(s,a)=1Mm=1Mδ(θψnm(s,a))Z_{\psi_n}(s,a) = \frac1M \sum_{m=1}^M \delta(\theta^m_{\psi_n}(s,a)). The Bellman target construction is as follows:

  • For a transition (s,a,r,s)(s,a,r,s'), actions aπϕ(s)a'\sim\pi_\phi(\cdot|s'), extract target quantile atoms {θψnm(s,a)}\{\theta^m_{\overline\psi_n}(s', a')\} from target networks {ψn}\{\overline\psi_n\}.
  • Pool all N×MN\times M atoms, sort ascending, and drop the largest (Mk)N(M-k)N atoms (for kMk \le M).
  • The surviving kNkN atoms define the truncated target distribution:

Y(s,a)=1kNi=1kNδ(r+γ[z(i)(s,a)αlogπϕ(as)])Y(s,a) = \frac{1}{kN} \sum_{i=1}^{kN} \delta\left( r + \gamma [z_{(i)}(s', a') - \alpha\log\pi_\phi(a'|s')] \right)

where z(i)z_{(i)} are the sorted pooled atoms.

Each critic learns to fit its own quantiles to the truncated mixture through the loss:

JZ(ψn)=E(s,a)D[1kNMm=1Mi=1kNρτmH(yi(s,a)θψnm(s,a))]J_Z(\psi_n) = \mathbb{E}_{(s,a)\sim\mathcal{D}}\left[ \frac{1}{kNM} \sum_{m=1}^M\sum_{i=1}^{kN} \rho^H_{\tau_m}\left(y_i(s, a) - \theta^m_{\psi_n}(s, a)\right) \right]

This approach enables precise adjustment of overestimation pessimism via the truncation hyperparameter d=Mkd = M - k.

3. Algorithmic Structure and Implementation

TQC’s learning cycle can be summarized as follows:

  1. Initialization: Actor πϕ\pi_\phi, NN ensemble critics {ψn}\{\psi_n\}, target critics {ψn}\{\overline{\psi}_n\}, entropy coefficient α\alpha.
  2. Experience gathering: Interact using πϕ\pi_\phi, accumulate (s,a,r,s)(s,a,r,s') tuples.
  3. Critic update per step:
    • Sample minibatch from replay buffer.
    • For each (s,a,r,s)(s,a,r,s'), sample aπϕ(s)a'\sim\pi_\phi(\cdot|s'), assemble and truncate pooled quantile atoms.
    • For each critic, perform gradient update with quantile regression loss against the truncated target.
  4. Policy improvement:
    • Optimize Es,a[αlogπϕ(as)1NMn,mθψnm(s,a)]\mathbb{E}_{s,a}\left[\alpha\log\pi_\phi(a|s) - \tfrac1{NM}\sum_{n,m} \theta^m_{\psi_n}(s,a)\right].
  5. Entropy temperature: Tune α\alpha as in SAC.
  6. Target update: Apply Polyak averaging ψnτψn+(1τ)ψn\overline\psi_n \gets \tau\psi_n + (1-\tau)\overline\psi_n.

Default hyperparameters: N=5N=5 critics, M=25M=25 quantiles per critic, d=2d=2 for truncation, hidden layers [3×512 for critic, 2×256 for actor], batch size 256, replay buffer 10610^6, γ=0.99\gamma=0.99, τ=0.005\tau=0.005, Adam optimizer at 3×1043\times10^{-4} (Kuznetsov et al., 2020, Dorka et al., 2021, Boré et al., 25 Feb 2025).

4. Overestimation Bias Control and Truncation Dynamics

TQC’s core mechanism for bias control is the truncation of the pooled right tail of the quantile distribution. By discarding the highest (Mk)N(M-k)N quantile atoms—regardless of which critics produced them—the algorithm eliminates extreme upward outliers characteristic of overestimation. This strategy ensures that the TD target (and thus the expected Q-value estimate) is bounded away from the maximum, providing tighter, more reliable estimates throughout training.

The truncation parameter dd controls the tradeoff between bias and variance: increasing dd induces greater pessimism (potential underestimation), while d=0d=0 corresponds to no bias correction and recovers standard distributional SAC. Empirical evidence shows optimal dd is environment dependent and typically small (d2d\approx2–3 for M=25M=25) (Kuznetsov et al., 2020, Dorka et al., 2021). Ensembling multiple critics serves to decorrelate estimation errors, so truncation robustly targets only outlier atoms instead of systematically biasing a single critic’s distribution.

5. Empirical Performance and Benchmark Results

TQC exhibits state-of-the-art performance across the MuJoCo continuous control suite relative to SAC and TD3. Key results at $1$ million steps (mean ± std over $10$ seeds):

Environment SAC TQC (d=2) % Improvement
Humanoid 7.76 ± 0.46 9.54 ± 1.18 +25%
Walker2d 5.76 ± 0.46 7.03 ± 0.62 +22%
HalfCheetah 12.41 ± 5.14 18.09 ± 0.34 +46%

Ablations confirm that performance gains result from both the use of ensembling and from the fine-tuned truncation mechanism. Notably, N3N\ge3 critics capture most of the variance reduction, and the benefit of higher NN saturates quickly.

On 6-DOF underwater vehicle control, TQC (with an appropriate reward function) achieves superior or comparable RMS error and settling time relative to a finely-tuned PID controller, while also supporting explicit power-consumption constraints (Boré et al., 25 Feb 2025). The energy-aware TQC variant reduces power usage by 30% at the expense of slight accuracy reductions.

6. Practical Considerations and Hyperparameters

Critical hyperparameters include the number of ensemble critics NN, number of quantile atoms MM, and truncation size dd. Default values are:

Other settings: actor/critic MLP sizes (2×256 actor, 3×512 critic), Adam optimizer, Huber loss threshold κ=1.0\kappa=1.0, discount γ=0.99\gamma=0.99, batch size 256, Polyak averaging coefficient τpolyak=0.005\tau_{\rm polyak}=0.005. Environment-specific tuning of dd can be required, though adaptive variants such as ACC automate this process (Dorka et al., 2021).

Best practices include always using target networks for quantile extraction during Bellman updates, monitoring the truncation ratio to avoid dominance by any single critic, and using non-truncated Q-values for policy updates to prevent compounding conservatism.

7. Extensions, Applications, and Limitations

Adaptive calibration of the truncation parameter, as explored in the ACC framework, removes the need for per-environment dd selection and improves generality (Dorka et al., 2021). TQC has proven effective beyond classical RL benchmarks, such as power-aware, direct end-to-end control of high-dimensional robotic systems without explicit system identification or thruster modeling (Boré et al., 25 Feb 2025).

Limitations include the need for hyperparameter tuning in standard settings and that results in certain domains, such as AUV control, have so far been demonstrated primarily in simulation rather than physical deployments. Additional ablation studies on quantile count, critic architecture, and reward shaping remain open for further investigation. A plausible implication is that TQC’s architectural motif—distributional critics, truncation, and ensembling—can be ported to other domains requiring robust bias correction.


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Truncated Quantile Critics (TQC).