Truncated Quantile Critics (TQC)

Updated 19 January 2026

TQC is an off-policy, distributional reinforcement learning algorithm that uses quantile critics and truncation to mitigate overestimation bias.
It employs an ensemble of critics and selectively discards extreme quantile values, ensuring fine-grained, stable value estimation in continuous control tasks.
Empirical results demonstrate that TQC significantly improves sample efficiency and performance over benchmarks like SAC and TD3, with up to 46% gains in some environments.

Truncated Quantile Critics (TQC) is an off-policy, distributional reinforcement learning algorithm designed to address overestimation bias in continuous control. TQC combines a quantile-based critic architecture with truncated target distributions and an ensemble of critics, providing fine-grained and stable value estimation. Building upon the Soft Actor-Critic (SAC) framework, TQC achieves substantial improvements in sample efficiency and performance across standard RL benchmarks as well as in power-constrained robotic control.

1. Theoretical Foundations and Problem Motivation

Efficient off-policy actor-critic algorithms frequently suffer from Q-value overestimation. This bias primarily arises from two mechanisms: the maximization operator applied to noisy value estimates (as per Jensen’s inequality) and the accumulation of bootstrapping errors within temporal-difference (TD) learning. In discrete spaces, methods such as Double DQN partially alleviate the bias; in continuous control, Twin Delayed DDPG (TD3) uses a minimum-of-two-critics strategy but provides coarse bias control and can induce underestimation. TQC addresses these limitations by modeling the full distribution of potential returns and applying a truncation operator on ensembles of quantile critics, allowing arbitrary granular control over the degree and effect of bias correction (Kuznetsov et al., 2020, Dorka et al., 2021).

2. Mathematical Formulation

TQC operates in the Markov Decision Process setting with state space $\mathcal{S}$ , action space $\mathcal{A}$ , stochastic transition $\mathcal{P}(s' \mid s, a)$ , reward $R(s, a)$ , and discount $\gamma \in [0,1)$ . The critic models the full random return distribution $Z^\pi(s,a) = \sum_{t\ge0}\gamma^t R(s_t, a_t)$ via a quantile representation:

$Z_\psi(s,a) = \frac{1}{M} \sum_{m=1}^M \delta\left(\theta^m_\psi(s,a)\right)$

with sorted, learnable quantile locations $\{\theta^m_\psi\}_{m=1}^M$ . Training minimizes the 1-Wasserstein (quantile regression) loss using the asymmetric quantile Huber loss:

$\rho^H_\tau(u) = |\tau - \mathbf{1}_{\{u<0\}}|\, \mathcal{L}^1_H(u),\quad \mathcal{L}^1_H(u) = \begin{cases} \tfrac{1}{2}u^2 & |u|\le1\ |u| - \tfrac{1}{2} & |u|>1 \end{cases}$

For enhanced robustness, TQC maintains $N$ independent quantile critics, each parameterized as $Z_{\psi_n}(s,a) = \frac1M \sum_{m=1}^M \delta(\theta^m_{\psi_n}(s,a))$ . The Bellman target construction is as follows:

For a transition $(s,a,r,s')$ , actions $a'\sim\pi_\phi(\cdot|s')$ , extract target quantile atoms $\{\theta^m_{\overline\psi_n}(s', a')\}$ from target networks $\{\overline\psi_n\}$ .
Pool all $N\times M$ atoms, sort ascending, and drop the largest $(M-k)N$ atoms (for $k \le M$ ).
The surviving $kN$ atoms define the truncated target distribution:

$Y(s,a) = \frac{1}{kN} \sum_{i=1}^{kN} \delta\left( r + \gamma [z_{(i)}(s', a') - \alpha\log\pi_\phi(a'|s')] \right)$

where $z_{(i)}$ are the sorted pooled atoms.

Each critic learns to fit its own quantiles to the truncated mixture through the loss:

$J_Z(\psi_n) = \mathbb{E}_{(s,a)\sim\mathcal{D}}\left[ \frac{1}{kNM} \sum_{m=1}^M\sum_{i=1}^{kN} \rho^H_{\tau_m}\left(y_i(s, a) - \theta^m_{\psi_n}(s, a)\right) \right]$

This approach enables precise adjustment of overestimation pessimism via the truncation hyperparameter $d = M - k$ .

3. Algorithmic Structure and Implementation

TQC’s learning cycle can be summarized as follows:

Initialization: Actor $\pi_\phi$ , $N$ ensemble critics $\{\psi_n\}$ , target critics $\{\overline{\psi}_n\}$ , entropy coefficient $\alpha$ .
Experience gathering: Interact using $\pi_\phi$ , accumulate $(s,a,r,s')$ tuples.
Critic update per step:
- Sample minibatch from replay buffer.
- For each $(s,a,r,s')$ , sample $a'\sim\pi_\phi(\cdot|s')$ , assemble and truncate pooled quantile atoms.
- For each critic, perform gradient update with quantile regression loss against the truncated target.
Policy improvement:
- Optimize $\mathbb{E}_{s,a}\left[\alpha\log\pi_\phi(a|s) - \tfrac1{NM}\sum_{n,m} \theta^m_{\psi_n}(s,a)\right]$ .
Entropy temperature: Tune $\alpha$ as in SAC.
Target update: Apply Polyak averaging $\overline\psi_n \gets \tau\psi_n + (1-\tau)\overline\psi_n$ .

Default hyperparameters: $N=5$ critics, $M=25$ quantiles per critic, $d=2$ for truncation, hidden layers [3×512 for critic, 2×256 for actor], batch size 256, replay buffer $10^6$ , $\gamma=0.99$ , $\tau=0.005$ , Adam optimizer at $3\times10^{-4}$ (Kuznetsov et al., 2020, Dorka et al., 2021, Boré et al., 25 Feb 2025).

4. Overestimation Bias Control and Truncation Dynamics

TQC’s core mechanism for bias control is the truncation of the pooled right tail of the quantile distribution. By discarding the highest $(M-k)N$ quantile atoms—regardless of which critics produced them—the algorithm eliminates extreme upward outliers characteristic of overestimation. This strategy ensures that the TD target (and thus the expected Q-value estimate) is bounded away from the maximum, providing tighter, more reliable estimates throughout training.

The truncation parameter $d$ controls the tradeoff between bias and variance: increasing $d$ induces greater pessimism (potential underestimation), while $d=0$ corresponds to no bias correction and recovers standard distributional SAC. Empirical evidence shows optimal $d$ is environment dependent and typically small ( $d\approx2$ –3 for $M=25$ ) (Kuznetsov et al., 2020, Dorka et al., 2021). Ensembling multiple critics serves to decorrelate estimation errors, so truncation robustly targets only outlier atoms instead of systematically biasing a single critic’s distribution.

5. Empirical Performance and Benchmark Results

TQC exhibits state-of-the-art performance across the MuJoCo continuous control suite relative to SAC and TD3. Key results at $1$ million steps (mean ± std over $10$ seeds):

Environment	SAC	TQC (d=2)	% Improvement
Humanoid	7.76 ± 0.46	9.54 ± 1.18	+25%
Walker2d	5.76 ± 0.46	7.03 ± 0.62	+22%
HalfCheetah	12.41 ± 5.14	18.09 ± 0.34	+46%

Ablations confirm that performance gains result from both the use of ensembling and from the fine-tuned truncation mechanism. Notably, $N\ge3$ critics capture most of the variance reduction, and the benefit of higher $N$ saturates quickly.

On 6-DOF underwater vehicle control, TQC (with an appropriate reward function) achieves superior or comparable RMS error and settling time relative to a finely-tuned PID controller, while also supporting explicit power-consumption constraints (Boré et al., 25 Feb 2025). The energy-aware TQC variant reduces power usage by 30% at the expense of slight accuracy reductions.

6. Practical Considerations and Hyperparameters

Critical hyperparameters include the number of ensemble critics $N$ , number of quantile atoms $M$ , and truncation size $d$ . Default values are:

$N=5$ , $M=25$ , $d=2$ for MuJoCo (Kuznetsov et al., 2020)
$N=2$ , $M=32$ , $d=2$ for 6-DOF AUV control (Boré et al., 25 Feb 2025)

Other settings: actor/critic MLP sizes (2×256 actor, 3×512 critic), Adam optimizer, Huber loss threshold $\kappa=1.0$ , discount $\gamma=0.99$ , batch size 256, Polyak averaging coefficient $\tau_{\rm polyak}=0.005$ . Environment-specific tuning of $d$ can be required, though adaptive variants such as ACC automate this process (Dorka et al., 2021).

Best practices include always using target networks for quantile extraction during Bellman updates, monitoring the truncation ratio to avoid dominance by any single critic, and using non-truncated Q-values for policy updates to prevent compounding conservatism.

7. Extensions, Applications, and Limitations

Adaptive calibration of the truncation parameter, as explored in the ACC framework, removes the need for per-environment $d$ selection and improves generality (Dorka et al., 2021). TQC has proven effective beyond classical RL benchmarks, such as power-aware, direct end-to-end control of high-dimensional robotic systems without explicit system identification or thruster modeling (Boré et al., 25 Feb 2025).

Limitations include the need for hyperparameter tuning in standard settings and that results in certain domains, such as AUV control, have so far been demonstrated primarily in simulation rather than physical deployments. Additional ablation studies on quantile count, critic architecture, and reward shaping remain open for further investigation. A plausible implication is that TQC’s architectural motif—distributional critics, truncation, and ensembling—can be ported to other domains requiring robust bias correction.

References:

(Kuznetsov et al., 2020) Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics
(Dorka et al., 2021) Adaptively Calibrated Critic Estimates for Deep Reinforcement Learning
(Boré et al., 25 Feb 2025) Toward 6-DOF Autonomous Underwater Vehicle Energy-Aware Position Control based on Deep Reinforcement Learning: Preliminary Results