Papers
Topics
Authors
Recent
Search
2000 character limit reached

Trajectory-Mixed Supervision (TMS)

Updated 10 February 2026
  • Trajectory-Mixed Supervision (TMS) is a paradigm that fuses diverse supervision signals from multiple trajectories, policies, or modalities to address supervision mismatch.
  • It reduces issues like catastrophic forgetting and mode collapse by integrating intermediate checkpoints and multi-modal data, enhancing sequential prediction and planning.
  • Empirical results show TMS delivers robust performance improvements in LLM tuning, trajectory prediction, retrieval, and game-theoretic planning with state-of-the-art metrics.

Trajectory-Mixed Supervision (TMS) is a paradigm for leveraging diverse or dynamically constructed supervision derived from multiple trajectories, policies, or modalities, rather than fixed pointwise labels or single references, in order to improve generalization, robustness, and task transfer in sequential prediction, planning, and control tasks. TMS has been advanced in various fields, including reinforcement learning (RL), LLM alignment, trajectory prediction, game-theoretic planning, and large-scale trajectory retrieval. Central to TMS is the idea of constructing supervision signals from a mixture or fusion of trajectory-level information—either sampled from intermediate policies, alternative future branches, or multimodal semantic encodings—to better capture the structure, diversity, and underlying invariances of the domain compared to standard, static, pointwise supervision.

1. Theoretical Motivation and Supervision Mismatch

Supervision mismatch arises when there is a divergence between the model’s evolving policy πθ\pi_\theta and the fixed supervision distribution q(yx)q(y|x), a phenomenon particularly acute in methods such as Supervised Fine-Tuning (SFT) for LLMs. In these regimes, catastrophic forgetting and mode collapse are observed because the model is repeatedly corrected toward a single reference or narrow support while its proper solution set is inherently multimodal. TMS directly addresses this by mixing supervision from the model’s own trajectory history, intermediate checkpoints, or diverse modalities to minimize the policy-label divergence (PLD):

PLD(πθ,π^)=ExDx[DKL(π^(x)πθ(x))]\mathrm{PLD}\bigl(\pi_{\theta}, \hat{\pi}\bigr) = \mathbb{E}_{x\sim \mathcal{D}_x} \left[ D_{\mathrm{KL}}(\hat{\pi}(\cdot\mid x)\,\|\, \pi_{\theta}(\cdot\mid x)) \right]

By aligning supervision with the policy support, TMS mitigates late-stage drift in PLD and prevents the loss of general capabilities, thus bridging the gap between RL and SFT with respect to stability and retention (Khan et al., 3 Feb 2026).

2. Algorithmic Instantiations Across Domains

TMS has been implemented using several algorithmic designs tailored to different domains:

2.1 Reward-Free, On-Policy SFT for LLMs

TMS operates as a two-stage curriculum:

  • Trajectory harvesting: Save TT intermediate checkpoints {θ1,,θT}\{\theta_1, \dots, \theta_T\} during SFT and, for each input xx, sample predictions y^(t)(x)\hat y^{(t)}(x) from each checkpoint.
  • Trajectory mixture supervision: Define

m(x)=1Tt=1Tδy^(t)(x)()m(\cdot|x) = \tfrac{1}{T}\sum_{t=1}^T \delta_{\hat y^{(t)}(x)}(\cdot)

Optionally interpolate this mixture with the original oracle labels via weight α\alpha:

qα(x)=αq(x)+(1α)m(x)q_\alpha(\cdot|x) = \alpha q(\cdot|x) + (1-\alpha) m(\cdot|x)

  • Student training: Minimize forward KL to qαq_\alpha, yielding a fine-tuned policy with substantially reduced catastrophic forgetting and mode collapse compared to SFT, and performance within $1$–2%2\% of RL on accuracy–retention metrics (Khan et al., 3 Feb 2026).

2.2 Trajectory Segment Chaining and Branching for Prediction

In vehicle trajectory prediction, TMS is instantiated as a multi-branch self-supervised predictor that chains predictions across multiple future segments and launches additional “overshooting” branches at intermediate points. Training objectives combine:

  • Future trajectory regression
  • Multi-modal mode classification
  • Latent context consistency
  • Predict-the-past reconstruction

Tree-like multi-branching enables fusion of multi-modal future rollouts, with beam-search heuristics pruning kNk^N multi-branch possibilities down to practical candidate sets. Quantitatively, this yields a 25%\sim25\% reduction in average displacement error compared to one-shot predictors, with meaningful, segment-wise uncertainty metrics derived from reconstruction and dropout variances (Janjoš et al., 2023).

2.3 Generalized Modality Fusion for Retrieval

For large-scale trajectory retrieval, TMS is manifested as omni-semantic supervision—embedding four complementary modalities: raw trajectories, topological keypoints, road IDs, and geographic regions. Each modality has its own Transformer-based encoder, and representations are aligned using bidirectional InfoNCE losses:

Lmn=1Ni=1Nlogexp(sim(hm(i),hn(i))/τ)j=1Nexp(sim(hm(i),hn(j))/τ)L_{m \to n} = -\frac{1}{N} \sum_{i=1}^{N} \log \frac{\exp(\mathrm{sim}(h_m^{(i)}, h_n^{(i)})/\tau)}{\sum_{j=1}^{N}\exp(\mathrm{sim}(h_m^{(i)}, h_n^{(j)})/\tau)}

At retrieval, trajectories can be queried or fused via any subset of modalities, with cross-modal alignment ensuring flexible condition-based retrieval and disentanglement of semantic subspaces. Ablations show that incorporating all four modalities delivers up to $0.909$ MRR and $0.857$ Hit@1, outperforming single-modality baselines (Zhu et al., 23 May 2025).

2.4 Masked Trajectory Reconstruction for Universal Control and Representation

Masked Trajectory Models (MTM) operationalize TMS by training Transformers to reconstruct arbitrary subsets of state–action trajectories, conditioned on random mask patterns. A single pretrained MTM can be repurposed for forward/inverse modeling, behavior cloning, return-conditioned control, and representation learning, selected solely by mask configuration. This design unifies and often matches or surpasses task-specific models across D4RL and Adroit benchmarks in behavior cloning and offline RL, and delivers features that accelerate downstream RL learning (Wu et al., 2023).

2.5 Mixed-Strategy Lifting in Game-Theoretic Planning

In two-player trajectory games, TMS arises by lifting each player’s strategy space to mixtures over a learned set of nin_i trajectory candidates. Offline phase trains generators producing trajectory references; online phase solves a bimatrix game over payoff matrices derived from agent cost functions. This yields Nash equilibria in the space of mixed trajectory distributions, delivering higher robustness and strategic diversity, and achieving computation speeds comparable to standard MPC (Peters et al., 2022).

3. Core Loss Functions and Fusion Mechanisms

TMS frameworks rely on trajectory-level objectives, hierarchical losses, and flexible encoder-fusion networks. Key elements include:

  • Mean-squared/token-level 2\ell_2 losses for masked token/object regression
  • Forward KL or InfoNCE for distributional alignment
  • Winner-takes-all regression for multi-modal imitation
  • Cross-modal projection heads for aligning representations in a shared space
  • Gated or concatenated fusion of modality embeddings for flexible downstream queries

These architectural elements are agnostic to backbone (Transformer, CNN, RNN), with empirical results showing Transformer-based encoders substantially outperforming other choices in large-scale retrieval tasks (Zhu et al., 23 May 2025).

4. Empirical Performance and Analysis

TMS has demonstrated strong, often state-of-the-art, results across domains:

Task/Domain Key Metric(s) TMS Performance Baseline Source
LLM tuning Cross-task Drop, F\mathcal{F} \sim2–3% drop, \sim79% retention SFT: 39% drop, 43% retention; RL: 80% (Khan et al., 3 Feb 2026)
Offline RL D4RL Mean Returns 78.7 (MTM) CQL: 77.6, DT: 74.7 (Wu et al., 2023)
Trajectory retrieval MRR, Hit@1 0.909, 0.857 Best single: 0.846, 0.791 (Zhu et al., 23 May 2025)
Trajectory prediction minADE, minFDE 0.30, 0.65 SS-ASP: 0.42, 0.90 (Janjoš et al., 2023)
Game-theoretic planning Value, robustness 1.58–1.67 (lifted) 1.37 (pure) (Peters et al., 2022)

TMS consistently either closes the performance gap with more costly RL or achieves notable improvements over single-supervision or single-modality strategies, while delivering additional robustness through diversity preservation and uncertainty estimation.

5. Practical Advantages and Open Challenges

Trajectory-Mixed Supervision enables:

  • Reward-free, on-policy-like adaptation without defining explicit external reward models
  • Significant reduction in catastrophic forgetting and improved retention of auxiliary/generalist capabilities
  • Cross-modal, condition-based, and multi-modal querying in retrieval and planning
  • Unified modeling for prediction, representation learning, and planning within a single architecture

However, TMS imposes extra compute for trajectory harvesting, increased storage (buffering intermediate outputs or modalities), and introduces hyperparameters (mixing weights, number of checkpoints/modalities, mask ratios) whose optimal settings can be task-dependent. Open questions remain regarding scalability to very large model families, compression of trajectory buffers, theoretical bounds on PLD-forgetting, and extension to high-dimensional or multimodal non-language domains (Khan et al., 3 Feb 2026).

6. Representative Implementations and Field Impact

TMS has been applied notably in:

  • LLM alignment, where it bridges most of the retention–accuracy gap to RL with reward-free process (Khan et al., 3 Feb 2026)
  • Trajectory retrieval engines capable of flexible condition-based search at urban scale (Zhu et al., 23 May 2025)
  • Offline RL as a universal backbone for multi-role control and rapid downstream learning (Wu et al., 2023)
  • Game-theoretic robotic planning via Nash equilibria over lifted strategy spaces (Peters et al., 2022)
  • Multi-stage vehicular trajectory prediction with uncertainty quantification (Janjoš et al., 2023)

The paradigm’s generality and empirical performance suggest it will remain central to sequential modeling where diversity, robustness, and adaptability are critical. Continued research is directed at scaling, theoretical guarantees, and expansion into multi-agent, multimodal, and interactively supervised domains.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Trajectory-Mixed Supervision (TMS).