Papers
Topics
Authors
Recent
Search
2000 character limit reached

Masked Trajectory Models (MTM) Overview

Updated 7 February 2026
  • Masked Trajectory Models are self-supervised models that mask and reconstruct parts of sequential trajectories to enable diverse predictive tasks.
  • They integrate various masking strategies and modality-specific encoders within unified architectures like Transformers to enhance performance in RL, control, and forecasting.
  • Applications span autonomous driving, mobility analytics, and scientific data, providing measurable improvements over specialized models on benchmark tasks.

A Masked Trajectory Model (MTM) is a self-supervised, mask-based predictive model that reconstructs missing (masked) elements of a trajectory from the unmasked context, thereby learning generalizable representations of sequential dynamics. MTMs unify a suite of masking-based pretraining objectives—originally developed for natural language and vision—within the domain of trajectories, encompassing states, actions, and auxiliary modalities such as returns, GPS stops, or physical/sports agent positions. MTMs now constitute a generalized backbone for prediction, representation learning, planning, and control in sequential decision-making, time-series analysis, multi-agent scenarios, mobility analytics, and scientific data domains.

1. Core Principles and Formalism

The central operation in MTMs is the reconstruction of a trajectory τ\tau from a randomly or selectively masked version Mask(τ)\mathrm{Mask}(\tau). For discrete/continuous trajectories,

τ={x1,,xN},Mask(τ)={xiiM}{[MASK]iM}\tau = \{x_1, \ldots, x_N\},\quad \mathrm{Mask}(\tau) = \{x_i | i \notin \mathcal{M}\} \cup \{\mathrm{[MASK]} | i \in \mathcal{M}\}

where M\mathcal{M} is the set of masked indices. The model is trained to predict xix_i for iMi \in \mathcal{M} given the unmasked context. Losses typically include MSE or NLL over masked tokens, and in more advanced settings, contrastive or manifold-structure objectives.

For multivariate/heteromorphic trajectories, different modalities (states, actions, returns, maps, semantic categories, user histories) may be independently masked and reconstructed. Mask patterns can be randomized, contiguous (for temporal forecasting), modality-specific (e.g., state vs. action vs. POI), or semantically structured (e.g., “mask all future,” “linear holes,” or “periodic components”).

The masking objective generalizes across sequence domains:

2. Architectural Variants and Input–Output Encodings

Generalized Architecture: MTMs use modality-specific encoders to lift each element into a shared latent space augmented by positional (temporal) and type (modality) embeddings. The backbone is usually a Transformer (bidirectional for BERT/MAE-style models), but state-space models (SSMs: BTM/Mamba), masked autoencoders (MAE), and conditional diffusion models also serve as MTM backbones (Wu et al., 2023, Long et al., 23 Jan 2025, Xu et al., 2024).

Tokenization/Embedding: Each token (e.g., sts_t, ata_t, rtr_t; or GPS POI/action; voxel group for TPCs) is embedded and tagged by time index, modality, and context (e.g., user-specific spatiotemporal preferences or ghost masking vectors). Inputs can be strictly sequential (state-action-reward), hierarchical (agent-specific tokens for multi-agent), or graph-derived embeddings (Garg et al., 28 Sep 2025, Xu et al., 2024, Young et al., 4 Feb 2025, Long et al., 23 Jan 2025).

Masking Dimensions:

  • Temporal: Random, block/contiguous, periodic, gap-aware.
  • Modal: State, action, reward, POI, map, energy, velocity, etc.
  • Spatial/Agent: Mask subsets of agents, polylines, spatial patches, or region blocks.
  • Semantic: Mask by seasonality/trend, or conditional on domain semantics.

Decoder: A specialized module (transformer decoder, MLP, or diffusion denoiser) operates over the union of unmasked and [MASK] tokens to predict the original trajectory tokens.

Losses:

3. Masking Strategies and Task Unification

MTMs explicitly leverage task formulation through mask patterns, enabling unification of diverse tasks (Wu et al., 2023, Long et al., 23 Jan 2025):

Task Mask Pattern Output (Prediction Target)
Forward Dynamics Mask next state st+1s_{t+1} given st,ats_t, a_t
Inverse Dynamics Mask action ata_t given st,st+1s_t, s_{t+1}
Policy/BC Mask next action at+1a_{t+1} conditioned on (s1,a1,,st)(s_1, a_1,\ldots,s_t)
Reward Model Mask reward rtr_t or gtg_t given (st,at)(s_t,a_t)
Imputation Mask arbitrary subsequence Reconstruct missing parts anywhere in τ\tau
Representation Mask actions/returns, encode Downstream embeddings for RL or transfer
Prediction Mask all future timepoints xt+1:Tx_{t+1:T} conditioned on observed x1:tx_{1:t}
Multi-Modal Mask POI/actions separately Cross-modal infilling; trajectory completion
Trajectory Generation Mask all steps Unconditional sample from the learned mobility prior

Editor's term: “mask-pattern prompt” — this mechanism allows a single model to emulate multiple specialized models by dynamically adjusting the mask during inference.

In conditional diffusion settings (e.g., (Long et al., 23 Jan 2025)), the mask schema recasts generation, recovery, and prediction all as conditional inpainting.

4. Applications and Representative Results

Reinforcement Learning and Control

  • Unified Backbone: A single pretrained MTM can serve as forward model, inverse model, policy (offline/RCBC), return-to-go predictor, and representation extractor from the same weights (Wu et al., 2023, Wen et al., 2024).
  • Empirical Results: On D4RL benchmarks, the same MTM backbone matches or surpasses Decision Transformer (DT), IQL, and CQL (e.g., average normalized score of 78.7 vs. 74.7 for DT) without specialized RL heads (Wu et al., 2023). In M³PC (Wen et al., 2024), test-time MPC atop a frozen MTM achieves +15.3% return gain over baseline transformers.

Trajectory Prediction and Multi-Agent Scenarios

  • Autonomous Driving: Masked autoencoders for trajectory prediction (Traj-MAE) reduce collision rates and miss rates by over 9–23% compared to Autobots baselines, especially with high (50–60%) masking and continual pretraining (Chen et al., 2023).
  • Sports: With ghost spatial masking and bidirectional temporal SSMs, MTMs achieve 10–28% lower average displacement and better spatial-temporal recovery on new sports datasets (Basketball-U, Football-U, Soccer-U) (Xu et al., 2024).

Mobility Analytics

  • Generalization: The GenMove diffusion-based MTM handles unconditional generation, in-filling, prediction, and recovery without task-specific architecture, outperforming strong baselines by up to 13% on generation, 6% on recovery, and 1–5% in prediction (Accuracy@k) (Long et al., 23 Jan 2025).
  • Semantic Reasoning: GPS-MTM models POI transitions and actions, yielding 10–30% higher accuracy and more uniform recall/bias on large-scale mobility datasets (Garg et al., 28 Sep 2025).

Scientific/Physical Trajectories

  • PoLAr-MAE: On LArTPC data, the MTM backend matches or exceeds supervised methods in track/shower classification with no labels, but sub-token microstructures remain underrepresented due to patching granularity (Young et al., 4 Feb 2025).

Vision-and-Language Navigation

  • Productivity: The MTM proxy task for VLN leads to +2.3–3.7 pt SR improvement, outperforming alternative future-view or panorama masking proxies (Li et al., 2023).

Time Series Analysis

  • Manifold Learning: SimMTM’s neighbor-aggregation reconstruction yields 8–15% MSE reduction and 8% accuracy improvement in classification and forecasting over prior masked or contrastive pretraining methods (Dong et al., 2023).
  • Component-Aware Masking: ST-MTM’s trend/season decomposition with period/subseries masks and masked-contrastive losses produces the best average MSE/MAE across standard forecasting benchmarks (MSE=0.32 vs. 0.33–0.47) (Seo et al., 13 Jun 2025).

5. Advanced Masking Techniques and Model Extensions

  • Structured and Adaptive Masking: Task-aligned patterns, such as goal-masking (Tang et al., 2022), ghost spatial masking (Xu et al., 2024), periodic block masking for seasonality (Seo et al., 13 Jun 2025), and curriculum-driven/learned masks, allow MTMs to efficiently learn from domain structure.
  • Auxiliary Modalities and Losses: Incorporation of auxiliary tasks—energy regression (Young et al., 4 Feb 2025), classifier-free guidance (contextual embedding dropout) (Long et al., 23 Jan 2025), or entropy constraints (Wen et al., 2024)—broadens the applicability to rich scientific or control settings.
  • Continual Pretraining: Multi-stage or continual schedules mitigate catastrophic forgetting across masking strategies (Traj-MAE), preserving performance across social, temporal, or map-structure masking regimes (Chen et al., 2023).
  • Online/Adaptive Inference: Test-time training with actor-specific token memory or planning-at-inference (T4P, M³PC) enables adaptation to distributional shifts or downstream tasks without further environment interactions or task-specific retraining (Park et al., 2024, Wen et al., 2024).
  • Diffusion Backbones: Diffusion-based MTMs generalize to arbitrary mask-inpainting tasks, supporting controllable, zero-shot, and privacy-aware mobility modeling (Long et al., 23 Jan 2025).

6. Limitations and Future Directions

  • Tokenization and Coverage: Fixed-radius or patching approaches may underrepresent sub-token features or rare classes, especially in scientific or dense multi-agent scenarios (Young et al., 4 Feb 2025).
  • Scalability: Handling long-horizon, high-dimensional, or pixel-observation trajectories remains computationally challenging for transformer-based MTMs; scaling architectural capacity and designing more efficient tokenizations are open problems (Wu et al., 2023, Wen et al., 2024).
  • Mask Strategy Optimization: Hand-designed mask schedules may be suboptimal; learned, adaptive, or curriculum-based masking (with sensitivity to semantic or temporal structure) could further improve pretext task alignment (Seo et al., 13 Jun 2025, Park et al., 2024).
  • Domain-Specificity: Some domains (e.g., heavy-tailed POI distributions in GPS (Garg et al., 28 Sep 2025), subtle infilling in physical trajectories) may require specialized representations, hybrid models (e.g., fused SSM/Transformer architectures), or additional auxiliary tasks.
  • Online/Interactive Use: Efficient integration of MTM pretraining with online RL, planning, anomaly detection, and transfer remains a key area for foundation model generality, especially in non-stationary and multi-modal environments.
  • Privacy and Data Sharing: For decentralized or privacy-sensitive mobility settings, federated MTM approaches hold promise for multi-institutional pretraining without raw-trajectory sharing (Long et al., 23 Jan 2025).

7. Representative Model Variants and Benchmarks

Model / Paper Domain Masking Type Key Results Highlight
Masked Trajectory Model (MTM) (Wu et al., 2023) RL/control random, contrastive One model matches/exceeds task-specific RL heads (D4RL, Adroit)
ST-MTM (Seo et al., 13 Jun 2025) Time series trend/seasonal Best MSE/MAE on 9 forecasting datasets
Traj-MAE (Chen et al., 2023) Multi-agent driving social, temporal, map Up to 23% miss-rate reduction, robust continual pretraining
GenMove (Long et al., 23 Jan 2025) Mobility/trajectories mask-unified, diffusion Unified recovery/prediction/generation, +13% JSD vs. SOTA
GPS-MTM (Garg et al., 28 Sep 2025) GPS/mobility state-action, span 10–30% gain in infilling, bias reduction
PoLAr-MAE (Young et al., 4 Feb 2025) Particle physics volumetric, energy SOTA unsupervised track/shower classification F-scores
T4P (Park et al., 2024) Driving/test-time autoencoder, actor-memory Best cross-dataset mADE/mFDE, fast adaptation
M³PC (Wen et al., 2024) RL/control, planning multi-mask, MPC +15.3% normalized return with no retraining

Extensions and further applications appear in goal-conditioned planning (Wen et al., 2024), vision-and-language navigation (Li et al., 2023), masked conditional mixture modeling (Tang et al., 2022), and physics-driven point trajectory modeling (Young et al., 4 Feb 2025).


In summary, Masked Trajectory Models provide a general-purpose foundation for trajectory-based learning, integrating a broad family of self-supervised mask-and-reconstruct paradigms that unify tasks ranging from forecasting to RL, prediction, imputation, and generation across a diverse set of domains (Wu et al., 2023, Chen et al., 2023, Seo et al., 13 Jun 2025, Long et al., 23 Jan 2025, Garg et al., 28 Sep 2025, Wen et al., 2024, Dong et al., 2023, Xu et al., 2024, Young et al., 4 Feb 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Masked Trajectory Models (MTM).