Decision Transformer: Sequence Modeling in RL

Updated 9 February 2026

The paper presents the Decision Transformer, which recasts reinforcement learning as a supervised sequence modeling problem by integrating return-to-go, state, and action tokens.
Methodological innovations include predictive coding, hierarchical prompts, and counterfactual data augmentation to enhance trajectory stitching and robustness.
Empirical results demonstrate that DT achieves competitive or superior performance across diverse domains such as robotics, scheduling, and autonomous driving.

A Decision Transformer (DT) is a causal, autoregressive transformer architecture that models sequential decision-making by casting offline reinforcement learning (RL) as a sequence modeling problem. DTs consume interleaved trajectories of return-to-go, state, and action tokens, employing supervised learning to predict optimal actions conditioned on desired future returns, states, and past actions. The approach leverages the scalability and expressiveness of Transformer models, enabling competitive or superior performance relative to specialized offline RL algorithms across diverse benchmarks including discrete, continuous, multi-goal, multi-agent, and safety-critical domains (Chen et al., 2021).

1. Sequence Modeling Paradigm and Core Framework

DT recasts RL via supervised sequence modeling. Given an offline dataset of trajectories, each episode is represented as a sequence

$(\widehat{R}_1, s_1, a_1, \ldots, \widehat{R}_T, s_T, a_T),$

where $\widehat{R}_t = \sum_{t'=t}^T r_{t'}$ is the return-to-go at time $t$ , $s_t$ is the state, and $a_t$ is the action. Each of these modalities is projected to a common embedding space, positionally encoded, and fed into a causal transformer. The architecture predicts the action token $a_t$ at each step $t$ conditioned on prior returns, states, and actions:

$P(a_t \mid \widehat{R}_{1:t}, s_{1:t}, a_{1:t-1}).$

Actions can be discrete (predicted via softmax/categorical cross-entropy loss) or continuous (via regression losses).

The training objective is standard cross-entropy or mean-squared error:

$\mathcal{L} = -\sum_{t=1}^{T} \log P_\theta(a_t \mid \widehat{R}_{1:t}, s_{1:t}, a_{1:t-1}),$

where $\theta$ denotes transformer parameters. At inference, a desired return is specified as a prompt; after each step, the return prompt is decremented by the received reward (Chen et al., 2021).

2. Architectural and Algorithmic Innovations

Input Tokenization and Conditioning

DTs interleave three token types per decision step: return-to-go, state, and action. Each token type has a dedicated embedding; time-step (or positional) encodings are added and tokens are causally masked, restricting attention to previous events. The number of previous steps (context length $k$ ) and architecture hyperparameters (layers, heads, dimension) are task-dependent.

During inference, the model is conditioned on an initial state and a target return (return-to-go); this mechanism enables return-conditioned imitation and flexible reward shaping, but precise target return choice may be nontrivial and highly sensitive (Luu et al., 2024, Jiang et al., 27 Jun 2025).

Extensions for Robustness and Expressivity

Several architectural and algorithmic augmentations have been proposed:

Predictive coding: Replaces or augments scalar return-to-go with learned, structured latent codes $z_t$ that encode both goal and temporal future context. The latent code is produced by an auxiliary trajectory autoencoder (often a bidirectional transformer) and is incorporated alongside standard tokens. This strengthens temporal compositionality and generalization in long-horizon and sparse-reward tasks (Luu et al., 2024).
Counterfactual data augmentation: Auxiliary models generate high-value counterfactual trajectories by proposing actions not seen in the dataset, evaluating them with learned models, and selectively augmenting the training distribution. Retraining DTs with these counterfactuals, as in CRDT, significantly improves “stitching” ability and performance under limited data (Nguyen et al., 14 May 2025).
Hierarchical prompts and goal-conditioning: DT is extended to handle hierarchical RL by incorporating high-level prompts (value, goal states, or symbolic subgoals) predicted by a separate policy. This enables effective trajectory stitching and superior performance over standard DTs in heterogeneous or multi-stage decision problems (Ma et al., 2023, Baheri et al., 10 Mar 2025, Rasanji et al., 19 Aug 2025).
Robustness enhancements: DT-based policies relabel return-to-go via minimax expectile regression to produce adversarially robust agents (ARDT), which achieve higher worst-case returns in both zero-sum games and continuous adversarial RL benchmarks (Tang et al., 2024).

3. Empirical Performance and Domain-Specific Adaptations

DTs demonstrate strong empirical performance across diverse RL domains:

Classic RL (Atari, MuJoCo, D4RL): Matches or outperforms state-of-the-art offline RL (e.g., CQL, IQL) in standard score metrics, robustness to sparsity/delays, and credit- or long-horizon assignment (Chen et al., 2021).
Multi-Agent and Industrial Dispatching: DTs are effective as decentralized dispatching policies in large-scale material handling systems. Each agent’s local state embedding includes aggregated global statistics. Deterministic, moderate-quality demonstration data enables DTs to surpass heuristic baselines (4–6% throughput gain), but DTs fail to outperform when trained on low-skill or stochastic data (Lee et al., 2024).
Multi-Goal and Robotic Tasks: DTs conditioned on goals embedded in state tokens achieve results matching or exceeding online RL with HER (TQC+HER), particularly in sparse-reward and data-efficient regimes (Gajewski et al., 2024). Symbolically guided and hierarchical DTs further improve multi-agent collaboration and manipulation by combining high-level plans with tokenized subgoals (Rasanji et al., 19 Aug 2025, Baheri et al., 10 Mar 2025).
Scheduling and Combinatorial Optimization: DTs trained on neural local search (NLS) trajectories for job-shop scheduling outperform their teacher when longer search times are amortizable, leveraging richer temporal context and optimistic return priors (Puiseau et al., 2024).
Autonomous Driving: Uncertainty-weighted DTs that upweight high-entropy (uncertain, high-impact) tokens, such as UWDT, achieve improved safety and efficiency in dense roundabout navigation. Per-token entropy is estimated by a frozen teacher, then used to weight student model updates (Zhang et al., 16 Sep 2025).
Auto-Bidding in Ad Auctions: Extensions on RTG guidance (memorization, learned forecasting, high-RTG data augmentation) yield 15% higher RTG/ROI improvements over vanilla DT in auto-bidding applications (Jiang et al., 27 Jun 2025).

4. Theoretical Analysis, Failure Modes, and Limitations

DTs possess several theoretical and practical properties unique to their autoregressive, return-conditioned paradigm:

Deadly triad avoidance: No Bellman backup or bootstrapping is required; simple supervised losses avoid divergence in off-policy RL (Chen et al., 2021).
Stitching limitations: Standard DTs cannot stitch optimal trajectories from suboptimal fragments. The return-to-go prompt provides only a lower bound for in-dataset values; without an explicit value function or model, fine-grained trajectory recombination is generally absent. Hierarchical or value-prompted DTs (ADT, V-ADT/G-ADT) address this with RL-based auxiliary critics and prompt-learning (Nguyen et al., 14 May 2025, Ma et al., 2023).
RTG redundancy: Conditioning on the full RTG history is theoretically redundant; only the latest RTG is necessary, as shown by the Decoupled Decision Transformer (DDT). This yields significant compute savings and performance improvements (Wang et al., 22 Jan 2026).
Adversarial settings: DTs trained solely on realizations from (potentially weak or stochastic) adversaries are brittle under stronger adversarial agents. ARDT relabels returns-to-go via in-sample minimax expectile regression, achieving Nash equilibrium strategies (Tang et al., 2024).
Generalization and out-of-distribution: Performance is highly sensitive to the quality and diversity of training data. Stochasticity, low-quality, and noisy data degrade DT’s ability to improve upon the generating policy (Lee et al., 2024). Adaptations such as counterfactual augmentation, uncertainty weighting, and predictive coding can mitigate these deficits.

5. DT Variants, Modifications, and Architectural Implications

Several DT variants have been developed:

Variant	Key Modification	Primary Effect
Predictive Coding DT	Condition on predictive codes $z_t$	Structured, temporally rich conditioning
Hierarchical/Goal-ADT	HRL with learned prompts/critic	Enables value propagation/stitching
Counterfactual DT	Model-generated counterfactuals	Trajectory recombination, OOD generalization
Decoupled DT (DDT)	Only latest RTG in conditioning	Lower computation, stronger local attention
Adversarial Robust DT	Minimax return-to-go relabeling	Nash strategy, worst-case robustness
Uncertainty Weighted DT	Per-token entropy loss reweighting	Focus on rare/safety-critical states

Beyond transformers, empirical studies such as Decision LSTM (DLSTM) show that RNN-based sequence models with the same sequence-tokenization paradigm can match or outperform transformer-based DTs in continuous control, with lower inference latency (Siebenborn et al., 2022). This suggests that sequence modeling, rather than self-attention itself, is the key enabling factor in many regimes.

6. Practical Methodological Considerations and Future Directions

Effective application of DTs involves several methodological choices:

Trajectory format and pre-processing: States, actions, and return-to-go must be standardized; episodic structure should be preserved; asynchronous events require agent-level tokenization in multi-agent systems (Lee et al., 2024).
Data quality and filtering: Deterministic, high-quality demonstrations are critical; removing noisy or stochastic segments enhances learning (Lee et al., 2024).
Target return selection: Test-time performance is not reliably monotonic in the prompted return, especially in safety-critical or low-diversity datasets (Lee et al., 2024, Luu et al., 2024).
Reward and state conditioning: Predictive coding or goal-planning tokens can replace or supplement RTG prompts for richer supervision (Luu et al., 2024, Baheri et al., 10 Mar 2025).
Compute trade-offs: Larger transformers confer stronger modeling but higher latency. Tasks with less stringent real-time constraints favor DTs over lightweight policies when long-term decision quality dominates (Puiseau et al., 2024).
Domain transfer and generalization: LoRA-style low-rank adapters and pretraining on large, diverse control datasets enable rapid few-shot adaptation and robust parameter-agnostic generalization (Zhang et al., 2024).

Future research highlights include integrating value-prediction with sequence modeling, automated prompt and subgoal discovery, more expressive reward and uncertainty conditioning strategies, active adaptation to data drift, and unified frameworks for multi-agent and hierarchical environments.

Decision Transformer represents an overview of supervised sequence modeling and reinforcement learning, enabling powerful, scalable, and flexible policies in offline domains. However, targeted algorithmic extensions are required to handle data bias, stochasticity, adversarial environments, stitching, and real-world applications. Ongoing work is refining DTs along the axes of theoretical grounding, efficient conditioning, and practical deployment across complex decision-making tasks (Chen et al., 2021, Lee et al., 2024, Luu et al., 2024, Zhang et al., 16 Sep 2025, Wang et al., 22 Jan 2026, Nguyen et al., 14 May 2025, Ma et al., 2023, Baheri et al., 10 Mar 2025).