Decision-Pretrained Transformers

Updated 25 January 2026

Decision-Pretrained Transformers are transformer-based models pretrained on historical trajectory data to predict optimal actions in sequential decision tasks.
They combine supervised, unsupervised, and prompt-based pretraining protocols to improve sample efficiency and long-range credit assignment in RL and bandit settings.
Recent enhancements, including working memory modules and hybrid architectures, address scalability challenges and domain mismatches for robust real-world applications.

A decision-pretrained transformer is a transformer-based model that has been pretrained to act as a decision-making algorithm in sequential problems—typically reinforcement learning (RL), bandit tasks, or sequential decision processes—by leveraging large, diverse datasets of prior trajectories, decisions, and outcomes. This paradigm can span both supervised and unsupervised learning, utilize prompt or context-based adaptation, and may fuse specialized architectural modifications for efficient long-range memory, sample efficiency, or in-context learning. The core principle is to distill the mapping from observed histories to high-quality actions directly into the transformer’s parameters via pretraining, yielding a foundation policy that generalizes, adapts, and/or explores in new environments with minimal downstream fine-tuning.

1. Foundations: Sequence Modeling for Decision Making

Transformers for decision-making abstract sequential problems as autoregressive sequence modeling. Classical RL can be expressed as modeling a trajectory $\tau = (s_1, a_1, r_1, \ldots, s_T, a_T, r_T)$ with a return-to-go tokenization $G_t = \sum_{t'=t}^{T} r_{t'}$ , where the conditional distribution over actions is factorized autoregressively: $\log p_\theta(a_{1:T} \mid G_{1:T}, s_{1:T}) = \sum_{t=1}^T \log p_\theta(a_t \mid G_{\leq t}, s_{\leq t}, a_{<t})$ The Decision Transformer (DT) operationalizes this by embedding tokens for state, action, and return-to-go, interleaving modalities, and fitting a causal masked transformer to offline datasets (Chen et al., 2021). This generic framework removes policy gradient and Bellman targets, instead learning direct action prediction conditioned on desired outcome (return), enabling powerful credit assignment in long-horizon tasks.

Pretraining a transformer with a supervised action-prediction loss,

$\mathcal{L}_{\mathrm{DT}} = \mathbb{E}_{\tau \sim \mathcal{D}} \biggl[ \frac{1}{K} \sum_{t=1}^{K} \|a_t - \hat{a}_t\|^2 \biggr]$

engenders a strong inductive bias toward sequential reasoning, in-context adaptation, and policy generalization.

2. Supervised and In-Context Pretraining Protocols

Decision-pretrained transformers rely on extensive pretraining protocols over diverse tasks and contexts:

Supervised Decision Pretraining: Models such as Decision-Pretrained Transformer (DPT) and variants (Lee et al., 2023, Lin et al., 2023) learn from tuples $(s_j, a_j, s_j', r_j)$ in a prompt $D$ of historical transitions, predicting the optimal next action in various tasks. The loss is usually cross-entropy over optimal actions (or MSE for continuous), and increasingly leverages in-context datasets to enable meta-learning: $L(\theta) = \mathbb{E}_{(D, s_{\rm query}, a^*)} \Bigl[ -\sum_{j=0}^n \log \pi_\theta(a^* \mid s_{\rm query}, D_j) \Bigr]$ The transformer thus internalizes the algorithmic mapping from dataset to optimal actions, yielding emergent online exploration and offline conservatism (Lee et al., 2023).
Unsupervised and Future-Conditioned Pretraining: Methods such as Pretrained Decision Transformer (PDT) depart from reward conditioning by using future trajectory embeddings (a latent $z$ summarizing possible futures). A GPT backbone jointly learns to predict actions given history and $z$ , even from reward-free datasets (Xie et al., 2023).
Prompt-Based Multi-task Pretraining: Prompting Decision Transformer (PDT) introduces stochastic trajectory prompts $\rho$ , serving as tokens that condition the model to different tasks during multi-task RL (Rietz et al., 7 Feb 2025). These prompts, sampled from demonstrations, disambiguate tasks and steer downstream adaptation.
Reward Prediction and Algorithmic Distillation: Recent protocols extend beyond cross-entropy action labels by predicting per-arm rewards directly (Mukherjee et al., 2024), allowing models to learn efficient in-context multi-task bandit strategies without access to optimal actions.

3. Model Architectures and Efficiency Enhancements

Standard decision-pretrained transformers use GPT-style causal architectures with interleaved modality embeddings and autoregressive tokenization. However, computational and memory bottlenecks of quadratic attention have led to several architectural advances:

Decision Mamba: Replaces the transformer backbone with Mamba, a state-space model (SSM)-based sequence processor. This reduces computational cost to nearly $G_t = \sum_{t'=t}^{T} r_{t'}$ 0 in context length, with strong long-term dependency modeling (Huang et al., 2024). DM-H is a hybrid design that generates latent sub-goals via Mamba for long context, then uses a small transformer for high-quality local predictions, achieving state-of-the-art sample efficiency and throughput (up to $G_t = \sum_{t'=t}^{T} r_{t'}$ 1 speedup over transformer baselines).
Working Memory Extensions: Explicit memory modules augment the transformer’s hidden state, providing distributed, content-addressable storage and adaptive retrieval to counteract catastrophic forgetting in multi-task regimes (Kang et al., 2023). LoRA fine-tuning adapts memory modules to new tasks with minimal parameter updates.
LLM Initializations and Markov Heads: Pretrained LLMs (e.g., GPT-2, DistilGPT2) provide powerful inductive priors to DTs for few-shot adaptation (Yang et al., 2024, Zhao et al., 2024). However, inherent Markov heads (strong diagonal attention to last token) may bias such models toward short-term dependencies, requiring attention-reweighting schemes (Mixture of Attention) for long-horizon RL.

4. Adaptation, Prompt Tuning, and Online Finetuning

Decision-pretrained transformers excel at in-context learning and prompt-based adaptation:

Bandit-Based Prompt Tuning: In Prompt-DT, adaptation to new tasks is handled by bandit algorithms that select high-quality demonstration segments as prompts at inference time, tuning performance without retraining massive transformer backbones (Rietz et al., 7 Feb 2025).
Online Finetuning and RL Gradient Augmentation: Online Decision Transformer (ODT) blends offline pretraining with online entropy-regularized finetuning. Integrating RL gradients (e.g., TD3) into finetuning provides local improvement directions, eliminating performance stalls when RTG prompts are out-of-distribution (Yan et al., 2024, Zheng et al., 2022).
Curiosity Regularization: Augmenting DPT with an auxiliary reward-prediction module (forming a Prediction-Powered Transformer, PPT) injects a curiosity-based exploration bonus, distilling exploration via prediction error during pretraining and enhancing robustness under data distribution shift (Yang et al., 30 Sep 2025).

5. Theoretical Analysis and Algorithmic Guarantees

Recent work provides theoretical justification for decision-pretrained transformers as efficient, sample-optimal RL and bandit solvers:

Equivalence to Bayesian Posterior Sampling: Under appropriate data sampling and model realizability, supervised decision-pretraining can be rigorously viewed as Bayesian posterior sampling, achieving provable $G_t = \sum_{t'=t}^{T} r_{t'}$ 2 regret in bandits and $G_t = \sum_{t'=t}^{T} r_{t'}$ 3 in finite-horizon MDPs (Lee et al., 2023, Lin et al., 2023). Transformers with ReLU attention heads can efficiently approximate LinUCB, Thompson Sampling, and UCB-VI algorithms.
Generalization Bound under Distribution Mismatch: When pretraining and test-time data distributions diverge, the excess regret scales with a distribution divergence factor

$G_t = \sum_{t'=t}^{T} r_{t'}$ 4

and the model’s covering number; pretraining under exploratory contexts reduces sample complexity (Lin et al., 2023).

Performative Prediction and OOD Risk: Pretraining transformers on simulated data, including self-generated rollouts, closes train-test distribution gaps and ensures robust performative risk bounds (Wang et al., 2024).

6. Practical Applications and Empirical Performance

Decision-pretrained transformers are state-of-the-art across a range of benchmarks:

Model	Domain	Key Metric	Performance
DM-H (Decision Mamba–Hybrid)	D4RL, Grid World, Tmaze	Online speed/test return	$G_t = \sum_{t'=t}^{T} r_{t'}$ 5 faster, SOTA sample efficiency
PDT + Bandit	Sparse 2D point, Cheetah	Average return, variance	$G_t = \sum_{t'=t}^{T} r_{t'}$ 6 over uniform, $G_t = \sum_{t'=t}^{T} r_{t'}$ 7 variance drop
LPDT (LM-init. Prompt-DT)	MuJoCo, MW ML1	Test episode return	$G_t = \sum_{t'=t}^{T} r_{t'}$ 8 in low-data, SOTA for few-shot
ODT + RL Gradient	Adroit, MuJoCo, AntMaze	Final normalized score	$G_t = \sum_{t'=t}^{T} r_{t'}$ 9 over pretrain, SOTA finetune
MetaTree (Decision-Tree DT)	Tabular datasets	Test acc. (depth-2 trees)	$\log p_\theta(a_{1:T} \mid G_{1:T}, s_{1:T}) = \sum_{t=1}^T \log p_\theta(a_t \mid G_{\leq t}, s_{\leq t}, a_{<t})$ 0 pts over GOSDT/CART, $\log p_\theta(a_{1:T} \mid G_{1:T}, s_{1:T}) = \sum_{t=1}^T \log p_\theta(a_t \mid G_{\leq t}, s_{\leq t}, a_{<t})$ 1 lower var
PFN-Boost (TabPFN+GBDT)	UCI tabular	Mean AUC/z-score	SOTA across all but smallest datasets

In all cases, decision-pretraining enables rapid generalization, high sample efficiency, and often superior transfer or zero-shot adaptation as compared to classical RL/bandit methods, behavior cloning, or model-based baselines.

7. Limitations, Extensions, and Future Directions

Current challenges and frontiers for decision-pretrained transformers include:

Handling Domain Mismatch: LM-initialized architectures require careful management of domain gaps (language vs. trajectories), and Markov heads bias short-term attention, necessitating MoA or hybrid architectures for robust planning (Zhao et al., 2024).
Scaling and Data Requirements: Performance increases with model, data volume, and data quality, suggesting joint scaling laws. Offline data diversity and quality remain bottlenecks in generalization (Lv et al., 15 Jan 2026).
Exploration versus Exploitation Balance: Implicit exploration can emerge via stochastic pretraining and in-context sampling, but explicit exploration regularizers (e.g., curiosity bonuses, bandit prompt tuning) may be required for high-OOD robustness (Yang et al., 30 Sep 2025).
Extending to New Domains: Decision-pretrained transformers are being extended to tabular, sequential, and multi-modal (language plus trajectory) data, fusion with GBDTs for scalable tabular learning, and hybrid memory architectures for lifelong and hierarchical RL (Zhuang et al., 2024, Jayawardhana et al., 4 Feb 2025).
Algorithmic Meta-Learning: By meta-learning to imitate or surpass algorithmic baselines (e.g., LinUCB, Oracle Trees), transformers can inherit or exceed the inductive biases of classical methods.

In summary, decision-pretrained transformers provide a unifying foundation for RL, bandit problems, and general sequential decision-making by internalizing decision algorithms into context-aware sequence models, enabling sample-efficient, adaptable, and theoretically-proven policy learning across a wide spectrum of environments (Huang et al., 2024, Rietz et al., 7 Feb 2025, Lin et al., 2023, Lv et al., 15 Jan 2026).