Transformer Q-Learning (TQL)

Updated 3 February 2026

Transformer Q-Learning (TQL) employs transformer architectures to model Q-functions, replacing traditional MLPs or RNNs with tokenized representations of state and action vectors.
It integrates attention-entropy regularization and temperature control to prevent attention collapse, ensuring stable gradient updates and improved learning dynamics.
Various instantiations like QDT, Q-Transformer, and QT-TDM demonstrate TQL's versatility in achieving up to 43% performance improvements in scalable offline and multi-task reinforcement learning.

Transformer Q-Learning (TQL) refers to a class of reinforcement learning algorithms that leverage transformer architectures for representing and learning Q-functions, with architectural and algorithmic modifications designed to address the unique stability and scaling challenges of RL. The paradigm encompasses several approaches spanning value-based, policy-conditioned, and sequence-modeling methods, unified by the use of transformers either as direct Q-function approximators or as part of hybrid frameworks leveraging offline datasets, temporal abstraction, or in-context adaptation.

1. Formulation and Core Principles

At its core, Transformer Q-Learning replaces conventional function approximators (MLP, RNN) in Q-learning with transformer networks, aiming to harness the scalability, sequence modeling, and representation power of transformers. The Q-function $Q_\phi(s, a)$ is modeled as a function of state and action vectors, tokenized into sequences processed by the transformer for value estimation.

Key components include:

Tokenization: Each scalar of the state vector $s \in \mathbb{R}^{n_s}$ and action vector $a \in \mathbb{R}^{n_a}$ forms a token. A special $[\text{VALUE}]$ token is prepended, yielding a total token sequence of length $n = 1+n_s+n_a$ (Dong et al., 1 Feb 2026).
Embedding Layer: Scalar tokens are projected into a hidden dimension, augmented with learned modality (state/action) and positional embeddings.
Transformer Stack: The sequence is fed through $L$ layers of (pre-layernorm) multi-head self-attention and MLP. Architectures vary from full encoder-decoders to GPT-style causal stacks (Liu et al., 2 Jun 2025).
Output Head: The final $[\text{VALUE}]$ token is decoded via an MLP (or an ensemble) to yield the scalar Q-value. Delayed target networks ( $Q_{\phi'}$ ) stabilize updates.

These principles underpin a range of models, from direct Q-function transformers (Dong et al., 1 Feb 2026, Stein et al., 2020) to autoregressive action sequence transformers (Chebotar et al., 2023, Kotb et al., 2024) and in-context RL agents (Liu et al., 2 Jun 2025).

2. Regularization and Stability: Preventing Attention Collapse

Scaling transformers as Q-function approximators surfaces unique pathologies, notably attention score collapse—as model capacity increases, softmax attention sharpens, causing most attention mass to concentrate on a vanishing subset of tokens. This yields brittle Q-landscapes and learning instability (Dong et al., 1 Feb 2026).

TQL addresses this via attention-entropy regularization:

Entropy Computation: For each layer $\ell$ , the row-normalized attention matrix $A^{\ell}$ yields per-token entropy:

$s \in \mathbb{R}^{n_s}$ 0

Temperature Regularization: With layerwise, learnable inverse-temperatures $s \in \mathbb{R}^{n_s}$ 1, losses enforce a fixed target entropy $s \in \mathbb{R}^{n_s}$ 2 per layer:

$s \in \mathbb{R}^{n_s}$ 3

$s \in \mathbb{R}^{n_s}$ 4

Total Critic Loss: The full objective combines Q-learning with entropy control:

$s \in \mathbb{R}^{n_s}$ 5

This design ensures broad, informative attention distributions and stable scaling to tens of millions of parameters. Empirically, it yields up to 43% improvement in performance when scaling from 0.4M to 26M parameters, in contrast with performance degradations in baseline architectures (Dong et al., 1 Feb 2026).

3. Architectures and Algorithmic Instantiations

Several concrete instantiations of Transformer Q-Learning have been proposed:

Model	Core Q-Function Paradigm	Notable Regularization	Primary Use-Case/Domain
TQL	Direct state-action transformer	Attention entropy	Scalable offline RL; OGBench
Q-learning Decision Transformer (QDT)	Sequence model, RTG-conditioned	Q-value RTG relabeling	Offline RL, sub-optimal data
Q-Transformer	Autoregressive action tokens	Conservative Q-learning	Multi-task robotics, language-vision control
QT-TDM	Q-Transformer + planning model	Smooth-L1 TD loss	Sample-efficient, high-dim continuous tasks
SICQL	Multi-head in-context transformer	Upper-expectile regression	Few-shot, in-context RL

TQL (Dong et al., 1 Feb 2026): Emphasizes entropy-regularized attention to unlock large-scale transformers for critic networks.
QDT (Yamagata et al., 2022): Relabels returns-to-go in datasets via an offline-trained Q-function, enabling policy transformers to stitch together optimal trajectories from suboptimal data.
Q-Transformer (Chebotar et al., 2023): Discretizes each action dimension, representing Q-values as a sequence of tokens for autoregressive max-backup; incorporates language/image tokens for multi-modal applications and utilizes a CQL-style penalty.
QT-TDM (Kotb et al., 2024): Combines a short-horizon Transformer Dynamics Model with an autoregressive Q-Transformer for efficient MPC planning in continuous spaces.
SICQL (Liu et al., 2 Jun 2025): Combines multitask in-context sequence modeling with Q-function heads and value function heads, using a pre-trained world model for compact prompting and advantage-weighted supervised policy regression.

4. Training Procedures, Objectives, and Losses

While all variants rely fundamentally on temporal-difference (TD) learning, they differ substantially in detail:

Vanilla Transformer Q-Learning (Direct Q-learning)

Standard Bellman loss on immediate state-action pairs, with target networks, experience replay, and entropy/variance regularization (Stein et al., 2020, Dong et al., 1 Feb 2026).
Variants utilize layernorm, learning rate tuning, and careful initialization for stability.

Sequence-Conditioned Transformer Approaches (QDT/DT)

Train on token sequences representing historical states, actions, and either observed or relabeled RTGs (Yamagata et al., 2022).
Loss is negative log-likelihood for next-action prediction. Key innovation is to “splice” optimal returns (from Q-learning) into RTGs for more effective bootstrapping.

Autoregressive Q-Transformers

Discretize action dimensions ( $s \in \mathbb{R}^{n_s}$ 6 bins per $s \in \mathbb{R}^{n_s}$ 7-dimensional action), represent Q-values for each action coordinate as a separate token (Chebotar et al., 2023, Kotb et al., 2024).
Bellman backup is performed sequentially per dimension, so that for $s \in \mathbb{R}^{n_s}$ 8, the target is $s \in \mathbb{R}^{n_s}$ 9, and for $a \in \mathbb{R}^{n_a}$ 0, it is $a \in \mathbb{R}^{n_a}$ 1.
Conservative Q-learning penalties on OOD actions stabilize offline training. Target networks use EMA updates.

In-Context and Multi-Head Architectures (SICQL)

Leverage causal transformers with policy, value, and Q heads, all conditioned on a prompt derived from recent trajectory chunks compressed by a world model encoder (Liu et al., 2 Jun 2025).
Bellman loss for Q, expectile regression for $a \in \mathbb{R}^{n_a}$ 2, and advantage-weighted supervised loss for policy, all conditioned on the context prompt.

5. Empirical Findings and Benchmarks

TQL models demonstrate robust empirical results:

Scaling: Whereas baseline Q-value transformers degrade with increased parameter count, TQL shows up to 43% improvement on OGBench (Dong et al., 1 Feb 2026). Vanilla transformers experience layerwise entropy collapse, corresponding to over-sharp, less expressive Q-landscapes.
Multi-task Robustness: Q-Transformer achieves 56% average success on a 700-task robotics benchmark, outperforming Decision Transformer (33%) and IQL (27%) (Chebotar et al., 2023).
Stitching Ability: QDT closes the gap between sequence-model and dynamic-programming methods, achieving near-optimality in grid-world and MuJoCo benchmarks characterized by suboptimal data, where vanilla DT fails (Yamagata et al., 2022).
Sample Efficiency: QT-TDM attains order-of-magnitude improvements in wall-clock planning over prior token-dense planners, while matching or surpassing performance in sparse-reward, high-dimensional control (Kotb et al., 2024).
In-context RL: SICQL attains improved task generalization and learning from suboptimal data via dynamic world model prompting and value-policy multi-heads (Liu et al., 2 Jun 2025).

6. Limitations, Challenges, and Design Considerations

Although TQL unlocks the scaling of transformers for reinforcement learning, several technical factors are critical:

Architectural Sensitivity: Transformer-based RL is fragile to placement of layer norm, initialization, gradient clipping, and regularization (Stein et al., 2020).
Action Discretization: Autoregressive Q-transformers mitigate the curse of dimensionality of joint discrete action spaces but remain sensitive to bin count and sequential backup bias (Chebotar et al., 2023, Kotb et al., 2024).
Off-policy and OOD Data: Conservative Q-learning penalties and entropy regularization are often necessary to prevent divergence on offline or mixed-quality data (Dong et al., 1 Feb 2026, Chebotar et al., 2023).
Prompting and Multi-task Generalization: Effective world model compression for prompting is crucial for in-context variants; poor context encodings can limit policy adaptation (Liu et al., 2 Jun 2025).
Planning Overhead: Models requiring long sequence tokenization or long-horizon model-predictive planning (e.g., Generalist TDM) incur significant inference costs; hybrid approaches like QT-TDM alleviate this via short-horizon rollouts with Q-terminal value approximation (Kotb et al., 2024).

7. Outlook and Broader Implications

Transformer Q-Learning establishes a scalable, stable foundation for integrating high-capacity sequence models into Q-learning and value-based RL. Its versatility is highlighted by:

Strong empirical scaling in high-dimensional offline RL.
Hybridization with world models and MPC, enabling both effective learning and sample-efficient planning.
Applicability to multi-modal, language-conditioned, and multi-task settings via flexible tokenization and context embedding.
Theoretical and mechanistic foundations via entropy maximization, which tie attention regularization to improved landscape smoothness and avoidance of collapse.

Ongoing research continues to investigate further improvements in sample efficiency, bridging value-based transformers with actor-critic and policy-gradient paradigms, and pushing generalization toward few-shot and in-context regimes. Attention entropy control and autoregressive Q-learning are now central tools in closing the scalability and robustness gap between transformers in supervised settings and reinforcement learning applications.

References:

TQL: Scaling Q-Functions with Transformers by Preventing Attention Collapse (Dong et al., 1 Feb 2026)
Q-learning Decision Transformer: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL (Yamagata et al., 2022)
Stabilizing Transformer-Based Action Sequence Generation For Q-Learning (Stein et al., 2020)
Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions (Chebotar et al., 2023)
QT-TDM: Planning With Transformer Dynamics Model and Autoregressive Q-Learning (Kotb et al., 2024)
Scalable In-Context Q-Learning (Liu et al., 2 Jun 2025)