Masked Trajectory Models (MTM) Overview
- Masked Trajectory Models are self-supervised models that mask and reconstruct parts of sequential trajectories to enable diverse predictive tasks.
- They integrate various masking strategies and modality-specific encoders within unified architectures like Transformers to enhance performance in RL, control, and forecasting.
- Applications span autonomous driving, mobility analytics, and scientific data, providing measurable improvements over specialized models on benchmark tasks.
A Masked Trajectory Model (MTM) is a self-supervised, mask-based predictive model that reconstructs missing (masked) elements of a trajectory from the unmasked context, thereby learning generalizable representations of sequential dynamics. MTMs unify a suite of masking-based pretraining objectives—originally developed for natural language and vision—within the domain of trajectories, encompassing states, actions, and auxiliary modalities such as returns, GPS stops, or physical/sports agent positions. MTMs now constitute a generalized backbone for prediction, representation learning, planning, and control in sequential decision-making, time-series analysis, multi-agent scenarios, mobility analytics, and scientific data domains.
1. Core Principles and Formalism
The central operation in MTMs is the reconstruction of a trajectory from a randomly or selectively masked version . For discrete/continuous trajectories,
where is the set of masked indices. The model is trained to predict for given the unmasked context. Losses typically include MSE or NLL over masked tokens, and in more advanced settings, contrastive or manifold-structure objectives.
For multivariate/heteromorphic trajectories, different modalities (states, actions, returns, maps, semantic categories, user histories) may be independently masked and reconstructed. Mask patterns can be randomized, contiguous (for temporal forecasting), modality-specific (e.g., state vs. action vs. POI), or semantically structured (e.g., “mask all future,” “linear holes,” or “periodic components”).
The masking objective generalizes across sequence domains:
- RL and Control: Mask state, action, and return tokens to simultaneously learn forward/inverse models, policies, and representations (Wu et al., 2023, Cai et al., 2023, Wen et al., 2024).
- Mobility and GPS: Mask states (POIs) and actions (transition attributes) for semantic reasoning and prediction (Garg et al., 28 Sep 2025, Long et al., 23 Jan 2025).
- Time-Series Analysis: Mask temporal points or seasonal/trend components for forecasting/classification (Dong et al., 2023, Seo et al., 13 Jun 2025).
- Multi-Agent and Autonomous Driving: Mask trajectory or map tokens for social/temporal reasoning and imputation (Chen et al., 2023, Xu et al., 2024, Park et al., 2024, Tang et al., 2022).
- Particle and Scientific Trajectories: Mask 3D trajectory segments, with auxiliary tasks (e.g., energy regression) (Young et al., 4 Feb 2025).
2. Architectural Variants and Input–Output Encodings
Generalized Architecture: MTMs use modality-specific encoders to lift each element into a shared latent space augmented by positional (temporal) and type (modality) embeddings. The backbone is usually a Transformer (bidirectional for BERT/MAE-style models), but state-space models (SSMs: BTM/Mamba), masked autoencoders (MAE), and conditional diffusion models also serve as MTM backbones (Wu et al., 2023, Long et al., 23 Jan 2025, Xu et al., 2024).
Tokenization/Embedding: Each token (e.g., , , ; or GPS POI/action; voxel group for TPCs) is embedded and tagged by time index, modality, and context (e.g., user-specific spatiotemporal preferences or ghost masking vectors). Inputs can be strictly sequential (state-action-reward), hierarchical (agent-specific tokens for multi-agent), or graph-derived embeddings (Garg et al., 28 Sep 2025, Xu et al., 2024, Young et al., 4 Feb 2025, Long et al., 23 Jan 2025).
Masking Dimensions:
- Temporal: Random, block/contiguous, periodic, gap-aware.
- Modal: State, action, reward, POI, map, energy, velocity, etc.
- Spatial/Agent: Mask subsets of agents, polylines, spatial patches, or region blocks.
- Semantic: Mask by seasonality/trend, or conditional on domain semantics.
Decoder: A specialized module (transformer decoder, MLP, or diffusion denoiser) operates over the union of unmasked and [MASK] tokens to predict the original trajectory tokens.
Losses:
- Reconstruction: MSE (continuous states/positions), cross-entropy/NLL (actions, categories, codebook tokens), Chamfer Distance (3D geometry), or WTA.
- Contrastive/InfoNCE: Applied to series embeddings to align masked copies or semantic views (Dong et al., 2023, Seo et al., 13 Jun 2025).
- Auxiliary: Energy regression, topological constraints, KL-divergence (in CVAEs), entropy regularization for policy models (Young et al., 4 Feb 2025, Wen et al., 2024, Long et al., 23 Jan 2025, Wen et al., 2024).
3. Masking Strategies and Task Unification
MTMs explicitly leverage task formulation through mask patterns, enabling unification of diverse tasks (Wu et al., 2023, Long et al., 23 Jan 2025):
| Task | Mask Pattern | Output (Prediction Target) |
|---|---|---|
| Forward Dynamics | Mask next state | given |
| Inverse Dynamics | Mask action | given |
| Policy/BC | Mask next action | conditioned on |
| Reward Model | Mask reward | or given |
| Imputation | Mask arbitrary subsequence | Reconstruct missing parts anywhere in |
| Representation | Mask actions/returns, encode | Downstream embeddings for RL or transfer |
| Prediction | Mask all future timepoints | conditioned on observed |
| Multi-Modal | Mask POI/actions separately | Cross-modal infilling; trajectory completion |
| Trajectory Generation | Mask all steps | Unconditional sample from the learned mobility prior |
Editor's term: “mask-pattern prompt” — this mechanism allows a single model to emulate multiple specialized models by dynamically adjusting the mask during inference.
In conditional diffusion settings (e.g., (Long et al., 23 Jan 2025)), the mask schema recasts generation, recovery, and prediction all as conditional inpainting.
4. Applications and Representative Results
Reinforcement Learning and Control
- Unified Backbone: A single pretrained MTM can serve as forward model, inverse model, policy (offline/RCBC), return-to-go predictor, and representation extractor from the same weights (Wu et al., 2023, Wen et al., 2024).
- Empirical Results: On D4RL benchmarks, the same MTM backbone matches or surpasses Decision Transformer (DT), IQL, and CQL (e.g., average normalized score of 78.7 vs. 74.7 for DT) without specialized RL heads (Wu et al., 2023). In M³PC (Wen et al., 2024), test-time MPC atop a frozen MTM achieves +15.3% return gain over baseline transformers.
Trajectory Prediction and Multi-Agent Scenarios
- Autonomous Driving: Masked autoencoders for trajectory prediction (Traj-MAE) reduce collision rates and miss rates by over 9–23% compared to Autobots baselines, especially with high (50–60%) masking and continual pretraining (Chen et al., 2023).
- Sports: With ghost spatial masking and bidirectional temporal SSMs, MTMs achieve 10–28% lower average displacement and better spatial-temporal recovery on new sports datasets (Basketball-U, Football-U, Soccer-U) (Xu et al., 2024).
Mobility Analytics
- Generalization: The GenMove diffusion-based MTM handles unconditional generation, in-filling, prediction, and recovery without task-specific architecture, outperforming strong baselines by up to 13% on generation, 6% on recovery, and 1–5% in prediction (Accuracy@k) (Long et al., 23 Jan 2025).
- Semantic Reasoning: GPS-MTM models POI transitions and actions, yielding 10–30% higher accuracy and more uniform recall/bias on large-scale mobility datasets (Garg et al., 28 Sep 2025).
Scientific/Physical Trajectories
- PoLAr-MAE: On LArTPC data, the MTM backend matches or exceeds supervised methods in track/shower classification with no labels, but sub-token microstructures remain underrepresented due to patching granularity (Young et al., 4 Feb 2025).
Vision-and-Language Navigation
- Productivity: The MTM proxy task for VLN leads to +2.3–3.7 pt SR improvement, outperforming alternative future-view or panorama masking proxies (Li et al., 2023).
Time Series Analysis
- Manifold Learning: SimMTM’s neighbor-aggregation reconstruction yields 8–15% MSE reduction and 8% accuracy improvement in classification and forecasting over prior masked or contrastive pretraining methods (Dong et al., 2023).
- Component-Aware Masking: ST-MTM’s trend/season decomposition with period/subseries masks and masked-contrastive losses produces the best average MSE/MAE across standard forecasting benchmarks (MSE=0.32 vs. 0.33–0.47) (Seo et al., 13 Jun 2025).
5. Advanced Masking Techniques and Model Extensions
- Structured and Adaptive Masking: Task-aligned patterns, such as goal-masking (Tang et al., 2022), ghost spatial masking (Xu et al., 2024), periodic block masking for seasonality (Seo et al., 13 Jun 2025), and curriculum-driven/learned masks, allow MTMs to efficiently learn from domain structure.
- Auxiliary Modalities and Losses: Incorporation of auxiliary tasks—energy regression (Young et al., 4 Feb 2025), classifier-free guidance (contextual embedding dropout) (Long et al., 23 Jan 2025), or entropy constraints (Wen et al., 2024)—broadens the applicability to rich scientific or control settings.
- Continual Pretraining: Multi-stage or continual schedules mitigate catastrophic forgetting across masking strategies (Traj-MAE), preserving performance across social, temporal, or map-structure masking regimes (Chen et al., 2023).
- Online/Adaptive Inference: Test-time training with actor-specific token memory or planning-at-inference (T4P, M³PC) enables adaptation to distributional shifts or downstream tasks without further environment interactions or task-specific retraining (Park et al., 2024, Wen et al., 2024).
- Diffusion Backbones: Diffusion-based MTMs generalize to arbitrary mask-inpainting tasks, supporting controllable, zero-shot, and privacy-aware mobility modeling (Long et al., 23 Jan 2025).
6. Limitations and Future Directions
- Tokenization and Coverage: Fixed-radius or patching approaches may underrepresent sub-token features or rare classes, especially in scientific or dense multi-agent scenarios (Young et al., 4 Feb 2025).
- Scalability: Handling long-horizon, high-dimensional, or pixel-observation trajectories remains computationally challenging for transformer-based MTMs; scaling architectural capacity and designing more efficient tokenizations are open problems (Wu et al., 2023, Wen et al., 2024).
- Mask Strategy Optimization: Hand-designed mask schedules may be suboptimal; learned, adaptive, or curriculum-based masking (with sensitivity to semantic or temporal structure) could further improve pretext task alignment (Seo et al., 13 Jun 2025, Park et al., 2024).
- Domain-Specificity: Some domains (e.g., heavy-tailed POI distributions in GPS (Garg et al., 28 Sep 2025), subtle infilling in physical trajectories) may require specialized representations, hybrid models (e.g., fused SSM/Transformer architectures), or additional auxiliary tasks.
- Online/Interactive Use: Efficient integration of MTM pretraining with online RL, planning, anomaly detection, and transfer remains a key area for foundation model generality, especially in non-stationary and multi-modal environments.
- Privacy and Data Sharing: For decentralized or privacy-sensitive mobility settings, federated MTM approaches hold promise for multi-institutional pretraining without raw-trajectory sharing (Long et al., 23 Jan 2025).
7. Representative Model Variants and Benchmarks
| Model / Paper | Domain | Masking Type | Key Results Highlight |
|---|---|---|---|
| Masked Trajectory Model (MTM) (Wu et al., 2023) | RL/control | random, contrastive | One model matches/exceeds task-specific RL heads (D4RL, Adroit) |
| ST-MTM (Seo et al., 13 Jun 2025) | Time series | trend/seasonal | Best MSE/MAE on 9 forecasting datasets |
| Traj-MAE (Chen et al., 2023) | Multi-agent driving | social, temporal, map | Up to 23% miss-rate reduction, robust continual pretraining |
| GenMove (Long et al., 23 Jan 2025) | Mobility/trajectories | mask-unified, diffusion | Unified recovery/prediction/generation, +13% JSD vs. SOTA |
| GPS-MTM (Garg et al., 28 Sep 2025) | GPS/mobility | state-action, span | 10–30% gain in infilling, bias reduction |
| PoLAr-MAE (Young et al., 4 Feb 2025) | Particle physics | volumetric, energy | SOTA unsupervised track/shower classification F-scores |
| T4P (Park et al., 2024) | Driving/test-time | autoencoder, actor-memory | Best cross-dataset mADE/mFDE, fast adaptation |
| M³PC (Wen et al., 2024) | RL/control, planning | multi-mask, MPC | +15.3% normalized return with no retraining |
Extensions and further applications appear in goal-conditioned planning (Wen et al., 2024), vision-and-language navigation (Li et al., 2023), masked conditional mixture modeling (Tang et al., 2022), and physics-driven point trajectory modeling (Young et al., 4 Feb 2025).
In summary, Masked Trajectory Models provide a general-purpose foundation for trajectory-based learning, integrating a broad family of self-supervised mask-and-reconstruct paradigms that unify tasks ranging from forecasting to RL, prediction, imputation, and generation across a diverse set of domains (Wu et al., 2023, Chen et al., 2023, Seo et al., 13 Jun 2025, Long et al., 23 Jan 2025, Garg et al., 28 Sep 2025, Wen et al., 2024, Dong et al., 2023, Xu et al., 2024, Young et al., 4 Feb 2025).