Trajectory-Mixed Supervision (TMS)

Updated 10 February 2026

Trajectory-Mixed Supervision (TMS) is a paradigm that fuses diverse supervision signals from multiple trajectories, policies, or modalities to address supervision mismatch.
It reduces issues like catastrophic forgetting and mode collapse by integrating intermediate checkpoints and multi-modal data, enhancing sequential prediction and planning.
Empirical results show TMS delivers robust performance improvements in LLM tuning, trajectory prediction, retrieval, and game-theoretic planning with state-of-the-art metrics.

Trajectory-Mixed Supervision (TMS) is a paradigm for leveraging diverse or dynamically constructed supervision derived from multiple trajectories, policies, or modalities, rather than fixed pointwise labels or single references, in order to improve generalization, robustness, and task transfer in sequential prediction, planning, and control tasks. TMS has been advanced in various fields, including reinforcement learning (RL), LLM alignment, trajectory prediction, game-theoretic planning, and large-scale trajectory retrieval. Central to TMS is the idea of constructing supervision signals from a mixture or fusion of trajectory-level information—either sampled from intermediate policies, alternative future branches, or multimodal semantic encodings—to better capture the structure, diversity, and underlying invariances of the domain compared to standard, static, pointwise supervision.

1. Theoretical Motivation and Supervision Mismatch

Supervision mismatch arises when there is a divergence between the model’s evolving policy $\pi_\theta$ and the fixed supervision distribution $q(y|x)$ , a phenomenon particularly acute in methods such as Supervised Fine-Tuning (SFT) for LLMs. In these regimes, catastrophic forgetting and mode collapse are observed because the model is repeatedly corrected toward a single reference or narrow support while its proper solution set is inherently multimodal. TMS directly addresses this by mixing supervision from the model’s own trajectory history, intermediate checkpoints, or diverse modalities to minimize the policy-label divergence (PLD):

$\mathrm{PLD}\bigl(\pi_{\theta}, \hat{\pi}\bigr) = \mathbb{E}_{x\sim \mathcal{D}_x} \left[ D_{\mathrm{KL}}(\hat{\pi}(\cdot\mid x)\,\|\, \pi_{\theta}(\cdot\mid x)) \right]$

By aligning supervision with the policy support, TMS mitigates late-stage drift in PLD and prevents the loss of general capabilities, thus bridging the gap between RL and SFT with respect to stability and retention (Khan et al., 3 Feb 2026).

2. Algorithmic Instantiations Across Domains

TMS has been implemented using several algorithmic designs tailored to different domains:

2.1 Reward-Free, On-Policy SFT for LLMs

TMS operates as a two-stage curriculum:

Trajectory harvesting: Save $T$ intermediate checkpoints $\{\theta_1, \dots, \theta_T\}$ during SFT and, for each input $x$ , sample predictions $\hat y^{(t)}(x)$ from each checkpoint.
Trajectory mixture supervision: Define

$m(\cdot|x) = \tfrac{1}{T}\sum_{t=1}^T \delta_{\hat y^{(t)}(x)}(\cdot)$

Optionally interpolate this mixture with the original oracle labels via weight $\alpha$ :

$q_\alpha(\cdot|x) = \alpha q(\cdot|x) + (1-\alpha) m(\cdot|x)$

Student training: Minimize forward KL to $q_\alpha$ , yielding a fine-tuned policy with substantially reduced catastrophic forgetting and mode collapse compared to SFT, and performance within $1$– $2\%$ of RL on accuracy–retention metrics (Khan et al., 3 Feb 2026).

2.2 Trajectory Segment Chaining and Branching for Prediction

In vehicle trajectory prediction, TMS is instantiated as a multi-branch self-supervised predictor that chains predictions across multiple future segments and launches additional “overshooting” branches at intermediate points. Training objectives combine:

Future trajectory regression
Multi-modal mode classification
Latent context consistency
Predict-the-past reconstruction

Tree-like multi-branching enables fusion of multi-modal future rollouts, with beam-search heuristics pruning $k^N$ multi-branch possibilities down to practical candidate sets. Quantitatively, this yields a $\sim25\%$ reduction in average displacement error compared to one-shot predictors, with meaningful, segment-wise uncertainty metrics derived from reconstruction and dropout variances (Janjoš et al., 2023).

2.3 Generalized Modality Fusion for Retrieval

For large-scale trajectory retrieval, TMS is manifested as omni-semantic supervision—embedding four complementary modalities: raw trajectories, topological keypoints, road IDs, and geographic regions. Each modality has its own Transformer-based encoder, and representations are aligned using bidirectional InfoNCE losses:

$L_{m \to n} = -\frac{1}{N} \sum_{i=1}^{N} \log \frac{\exp(\mathrm{sim}(h_m^{(i)}, h_n^{(i)})/\tau)}{\sum_{j=1}^{N}\exp(\mathrm{sim}(h_m^{(i)}, h_n^{(j)})/\tau)}$

At retrieval, trajectories can be queried or fused via any subset of modalities, with cross-modal alignment ensuring flexible condition-based retrieval and disentanglement of semantic subspaces. Ablations show that incorporating all four modalities delivers up to $0.909$ MRR and $0.857$ Hit@1, outperforming single-modality baselines (Zhu et al., 23 May 2025).

2.4 Masked Trajectory Reconstruction for Universal Control and Representation

Masked Trajectory Models (MTM) operationalize TMS by training Transformers to reconstruct arbitrary subsets of state–action trajectories, conditioned on random mask patterns. A single pretrained MTM can be repurposed for forward/inverse modeling, behavior cloning, return-conditioned control, and representation learning, selected solely by mask configuration. This design unifies and often matches or surpasses task-specific models across D4RL and Adroit benchmarks in behavior cloning and offline RL, and delivers features that accelerate downstream RL learning (Wu et al., 2023).

2.5 Mixed-Strategy Lifting in Game-Theoretic Planning

In two-player trajectory games, TMS arises by lifting each player’s strategy space to mixtures over a learned set of $n_i$ trajectory candidates. Offline phase trains generators producing trajectory references; online phase solves a bimatrix game over payoff matrices derived from agent cost functions. This yields Nash equilibria in the space of mixed trajectory distributions, delivering higher robustness and strategic diversity, and achieving computation speeds comparable to standard MPC (Peters et al., 2022).

3. Core Loss Functions and Fusion Mechanisms

TMS frameworks rely on trajectory-level objectives, hierarchical losses, and flexible encoder-fusion networks. Key elements include:

Mean-squared/token-level $\ell_2$ losses for masked token/object regression
Forward KL or InfoNCE for distributional alignment
Winner-takes-all regression for multi-modal imitation
Cross-modal projection heads for aligning representations in a shared space
Gated or concatenated fusion of modality embeddings for flexible downstream queries

These architectural elements are agnostic to backbone (Transformer, CNN, RNN), with empirical results showing Transformer-based encoders substantially outperforming other choices in large-scale retrieval tasks (Zhu et al., 23 May 2025).

4. Empirical Performance and Analysis

TMS has demonstrated strong, often state-of-the-art, results across domains:

Task/Domain	Key Metric(s)	TMS Performance	Baseline	Source
LLM tuning	Cross-task Drop, $\mathcal{F}$	$\sim$ 2–3% drop, $\sim$ 79% retention	SFT: 39% drop, 43% retention; RL: 80%	(Khan et al., 3 Feb 2026)
Offline RL	D4RL Mean Returns	78.7 (MTM)	CQL: 77.6, DT: 74.7	(Wu et al., 2023)
Trajectory retrieval	MRR, Hit@1	0.909, 0.857	Best single: 0.846, 0.791	(Zhu et al., 23 May 2025)
Trajectory prediction	minADE, minFDE	0.30, 0.65	SS-ASP: 0.42, 0.90	(Janjoš et al., 2023)
Game-theoretic planning	Value, robustness	1.58–1.67 (lifted)	1.37 (pure)	(Peters et al., 2022)

TMS consistently either closes the performance gap with more costly RL or achieves notable improvements over single-supervision or single-modality strategies, while delivering additional robustness through diversity preservation and uncertainty estimation.

5. Practical Advantages and Open Challenges

Trajectory-Mixed Supervision enables:

Reward-free, on-policy-like adaptation without defining explicit external reward models
Significant reduction in catastrophic forgetting and improved retention of auxiliary/generalist capabilities
Cross-modal, condition-based, and multi-modal querying in retrieval and planning
Unified modeling for prediction, representation learning, and planning within a single architecture

However, TMS imposes extra compute for trajectory harvesting, increased storage (buffering intermediate outputs or modalities), and introduces hyperparameters (mixing weights, number of checkpoints/modalities, mask ratios) whose optimal settings can be task-dependent. Open questions remain regarding scalability to very large model families, compression of trajectory buffers, theoretical bounds on PLD-forgetting, and extension to high-dimensional or multimodal non-language domains (Khan et al., 3 Feb 2026).

6. Representative Implementations and Field Impact

TMS has been applied notably in:

LLM alignment, where it bridges most of the retention–accuracy gap to RL with reward-free process (Khan et al., 3 Feb 2026)
Trajectory retrieval engines capable of flexible condition-based search at urban scale (Zhu et al., 23 May 2025)
Offline RL as a universal backbone for multi-role control and rapid downstream learning (Wu et al., 2023)
Game-theoretic robotic planning via Nash equilibria over lifted strategy spaces (Peters et al., 2022)
Multi-stage vehicular trajectory prediction with uncertainty quantification (Janjoš et al., 2023)

The paradigm’s generality and empirical performance suggest it will remain central to sequential modeling where diversity, robustness, and adaptability are critical. Continued research is directed at scaling, theoretical guarantees, and expansion into multi-agent, multimodal, and interactively supervised domains.

Markdown Report Issue Upgrade to Chat

References (5)

TMS: Trajectory-Mixed Supervision for Reward-Free, On-Policy SFT (2026)

Bridging the Gap Between Multi-Step and One-Shot Trajectory Prediction via Self-Supervision (2023)

Learning Generalized and Flexible Trajectory Models from Omni-Semantic Supervision (2025)

Masked Trajectory Models for Prediction, Representation, and Control (2023)

Learning Mixed Strategies in Trajectory Games (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Trajectory-Mixed Supervision (TMS).

Trajectory-Mixed Supervision (TMS)

1. Theoretical Motivation and Supervision Mismatch

2. Algorithmic Instantiations Across Domains

2.1 Reward-Free, On-Policy SFT for LLMs

2.2 Trajectory Segment Chaining and Branching for Prediction

2.3 Generalized Modality Fusion for Retrieval

2.4 Masked Trajectory Reconstruction for Universal Control and Representation

2.5 Mixed-Strategy Lifting in Game-Theoretic Planning

3. Core Loss Functions and Fusion Mechanisms

4. Empirical Performance and Analysis

5. Practical Advantages and Open Challenges

6. Representative Implementations and Field Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Trajectory-Mixed Supervision (TMS)

1. Theoretical Motivation and Supervision Mismatch

2. Algorithmic Instantiations Across Domains

2.1 Reward-Free, On-Policy SFT for LLMs

2.2 Trajectory Segment Chaining and Branching for Prediction

2.3 Generalized Modality Fusion for Retrieval

2.4 Masked Trajectory Reconstruction for Universal Control and Representation

2.5 Mixed-Strategy Lifting in Game-Theoretic Planning

3. Core Loss Functions and Fusion Mechanisms

4. Empirical Performance and Analysis

5. Practical Advantages and Open Challenges

6. Representative Implementations and Field Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research