Multimodal Trajectory Predictions

Updated 10 February 2026

Multimodal trajectory prediction is the process of estimating multiple plausible future paths for agents by analyzing past movement and scene context.
Methodologies include latent-variable generative models, anchor-based frameworks, and grid-based approaches to capture diverse behaviors.
These techniques support autonomous driving and robotics by enabling risk-aware planning, evaluated using metrics like minADE, minFDE, and mAP.

Multimodal trajectory prediction refers to the data-driven estimation, from agent history and context, of multiple diverse and plausible candidate futures for an agent in dynamic, interactive environments. Unlike deterministic single-output models, which only provide a point estimate of future behavior, multimodal predictors produce a probability distribution or set of trajectory hypotheses, covering the space of feasible agent actions under the inherent uncertainty induced by ambiguous intent, interaction, and scene structure (Huang et al., 2023). This problem domain is central in autonomous driving, social robotics, and human–computer interaction, as it directly supports downstream safety-critical planning and risk assessment.

1. Problem Formulation and Multimodality

The core challenge in multimodal trajectory prediction is that, for any observed partial trajectory $X = \{x_1, \dots, x_{T_\text{obs}}\}$ , there are often multiple non-exclusive, socially and physically plausible futures $Y_{1:K} = \{Y^{(k)}\}_{k=1}^K$ . Multimodal methods define a conditional distribution $p(Y|X,\mathcal{S})$ over future paths $Y$ , conditioned not only on agent history $X$ but also on scene context $\mathcal{S}$ (such as HD maps, raster images, or the trajectories of nearby agents) (Huang et al., 2023, Cui et al., 2018, Phan-Minh et al., 2019). The multimodal setting is characterized by:

Ambiguous intent: Multiple goals or maneuvers (e.g., turning left vs. continuing straight) may be equally likely given past data.
Social and physical interaction: The presence and anticipated behavior of other agents can lead to stochastic branching.
Complex context: The spatial layout, road geometry, and semantic scene understanding constrain possible goals and paths.

Models represent output multimodality via either explicit enumeration of $K$ trajectory hypotheses with probabilities, structured probabilistic output (e.g., mixture models), or implicit generative sampling (Eiffert et al., 2020, Wang et al., 2020, Takeyama et al., 2024).

2. Model Taxonomy and Methodological Advances

A comprehensive taxonomy of multimodal trajectory prediction methods, as synthesized in survey (Huang et al., 2023), comprises:

A. Latent-variable generative frameworks

GANs: Generator $G$ maps history–noise pairs to future trajectories; diversity from sampling distinct $z\sim p(z)$ (Eiffert et al., 2020).
CVAEs/VRNNs: Conditional variational auto-encoders with $z\sim q(z|X,Y)$ , decoding $Y$ from $(X,z)$ , employing ELBO objectives; VRNNs allow time-varying stochasticity (Brito et al., 2020).
Normalizing Flows/Diffusion: Invertible flow models or denoising diffusion models sample diverse outputs via stochastic paths (Yan et al., 10 Jun 2025).

B. Anchor-conditioned and prototype-based frameworks

Endpoint-conditioned models: Predict multimodal endpoint distributions via heatmaps or set-based classification (e.g., “goal candidates” on vectorized lanes), then decode full trajectories conditioned on sampled endpoints (Yuan et al., 2021, Dendorfer et al., 2020).
Trajectory-set classification: Model fixed or dynamic sets of feasible trajectories as discrete classes; outputs are mode-probabilities over sets of physically realizable behaviors (Phan-Minh et al., 2019).
Prototype-based clustering/classification: Discover high-level behavior modes by clustering in latent space, then classify and synthesize specific trajectories for each mode (Sun et al., 2021).
Topological invariance: Collapse joint agent behaviors into a combinatorial set of modes using topological signatures (e.g., winding numbers), then learn to reconstruct continuous trajectories for each mode (Roh et al., 2020).

C. Grid-based and occupancy approaches

Predict discrete distributions over spatial grids or heatmaps at endpoints (or waypoints), followed by downstream trajectory synthesis and mode compression (Yuan et al., 2021).

D. Mixture density and GMM-based output models

Output time-dependent or trajectory-level Gaussian mixtures, with mode selection either via explicit assignment or clustering (Eiffert et al., 2020, Brito et al., 2020).

E. Attention and context-adaptive models

Employ class-aware, lane-aware, or context-pruned attention to dynamically filter and fuse the influence of neighbors and map elements according to predicted intention and goal occupancy (Pathiraja et al., 2022, Sun et al., 12 Apr 2025).

3. Core Architectures and Training Objectives

Architectural components of leading multimodal trajectory predictors include:

Sequence encoders: LSTM/GRU, 1D-CNN, or transformer-based encoders for past trajectory and interaction context (Cui et al., 2018, Wang et al., 2020, Yan et al., 10 Jun 2025).
Scene/context encoders: ResNet or U-Net extractors for rasterized map images; GPN/GraphNet modules for vectorized map and lane graphs (Phan-Minh et al., 2019, Deo et al., 2021).
Interaction modules: Graph attention or pooled encoders (social/GVAT, class-aware) for encoding inter-agent or agent–vehicle dependencies (Eiffert et al., 2020, Pathiraja et al., 2022).
Intention priors: Dedicated modules for extracting maneuver- or intention-class distributions from gaze observations, agent kinematics, or map traversals (Zhang et al., 2022, Deo et al., 2021, Sun et al., 12 Apr 2025).
Decoder structures: Parallel multi-head (anchor, mode, or prototype) decoders, mixture density regression heads, or latent-conditioned sample decoders (Cui et al., 2018, Wang et al., 2020, Brito et al., 2020).

Typical training objectives blend negative log-likelihood, winner-take-all regression over $K$ predicted modes, focal/mode assignment loss, and auxiliary intent/occupancy cross-entropy terms (Cui et al., 2018, Sun et al., 2021, Sun et al., 12 Apr 2025). Explicit mode-diversity or coverage-enhancing losses are sometimes incorporated to prevent mode collapse (Wang et al., 2020, Yuan et al., 2021).

4. Datasets and Evaluation Metrics

Standard datasets provide the measurement backbone for comparative evaluation, including:

Dataset	Domain	Obs / Pred (s)	#Scenes	Features
ETH/UCY	Pedestrian	3.2 / 4.8	5,000+	Social, open space
SDD	Pedestrian	2.0 / 4.0	8,000+	Multi-agent, dense
Argoverse 1/2	Vehicle	2 / 3	327k+	HD maps, urban
Waymo Open Mot.	Vehicle	5 / 8	200k+	Multi-agent, large
nuScenes	Vehicle	2 / 6	1,000	Map, lidar/radar

Metrics are tailored to the multimodal setting (Huang et al., 2023):

minADE_K: Minimum Average Displacement Error over top $K$ modes.
minFDE_K: Minimum Final Displacement Error over top $K$ modes.
Miss Rate @ d: Fraction of samples with no prediction within $d$ meters.
Probability-aware metrics: mAP, Soft mAP, PCMD, KDE-NLL, evaluating the calibration and coverage of probabilistic outputs.
Distribution-aware metrics: EMD (Earth Mover’s Distance), multi-ground-truth precision/recall (when available) (Huang et al., 2023).

5. Representative Algorithms and Empirical Performance

Exemplar methods highlight the diversity of methodological approaches and their empirical trade-offs:

Generative Methods:

PCGAN employs MDN heads and adversarial loss, with explicit social-vehicle attention (Eiffert et al., 2020).
Social-VRNN learns a one-shot latent-variable model, directly outputting GMM parameters; it achieves average ADE/FDE of 0.44/0.61 m on ETH/UCY (Brito et al., 2020).

Anchor and Classification Approaches:

CoverNet classifies the agent’s future over a dynamically constructed physically feasible trajectory set ( $K\approx1000$ ), reaching minADE $_5$ ≈1.48 m on nuScenes (Phan-Minh et al., 2019).
PCCSNet (modality clustering + classification + synthesis) reduces ETH/UCY ADE by 19% over STAR by learned prototype assignment and modal synthesis (Sun et al., 2021).

Map-and-Goal-Conditioned Paradigms:

PGP employs discrete rollout over traversals in lane-graphs (lateral modes) and latent-variable modeling for longitudinal diversity; it attains state-of-the-art minADE $_{10}$ =1.00 m on nuScenes (Deo et al., 2021).
Goal-GAN factors prediction into interpretable goal estimation and local routing, achieving mode coverage $>$ 92% in synthetic 4-way tasks and surpassing Social-BiGAT on ETH/UCY (Dendorfer et al., 2020).

Attention and Context-pruned Frameworks:

Class-aware attention integrates agent class and dimensions into scene–neighbor weighting, improving minADE $_5$ to 1.67 m on nuScenes at more than 300 FPS (Pathiraja et al., 2022).
IMPACT jointly predicts intention and mode-conditioned trajectory, using learned adaptive context trimming for large-scale scenarios. On Waymo Open Motion, IMPACT achieves Soft mAP=0.4721 (without LiDAR), improving over BeTOP by 10% and supporting real-time vehicle deployment (Sun et al., 12 Apr 2025).

Flow-matching and Diffusion Models:

TrajFlow introduces flow matching with single-pass $N_q$ -trajectory inference, Plackett-Luce ranking, and self-conditioning, reaching top-tier performance (minADE=0.5712, minFDE=1.1662) on Waymo, with real-time inference throughput (Yan et al., 10 Jun 2025).

6. Application Contexts and Planning Integration

Multimodal trajectory prediction enables:

Motion planning: Planners compute risk-minimizing actions by considering the probability and geometry of each predicted mode (Roh et al., 2020, Cui et al., 2018).
Risk assessment: Models such as AOI-augmented gaze predictors support early intent inference and lead to earlier collision warning at intersections (risk lead time ~3 s, 0 false alarms in simulation) (Zhang et al., 2022).
Human–robot interaction: Multimodal frameworks facilitate anticipation during cooperative/competitive navigation, including multi-agent scenarios with explicit topological intent encoding (Roh et al., 2020).
Embodied AI: Multimodal agent models (e.g., TR-LLM) fuse language, spatial, and kinematic context for robust action and object anticipation in partially observed scenes (Takeyama et al., 2024).

Explicitly mode-aware frameworks improve interpretability and provide calibrated uncertainty estimates, crucial for planners to hedge against rare but critical outcomes.

7. Open Challenges and Future Directions

Despite substantial advances, several challenges persist:

Evaluation: Existing metrics can suffer from information leak; distribution-aware metrics requiring multi-ground-truth trajectories remain uncommon (Huang et al., 2023).
Mode coverage vs plausibility: Generative models often maximize diversity at accuracy’s expense, whereas anchor-based and classification models may miss rare but valid behaviors.
Real-time constraints: High $K$ -mode decoders lead to heavy computational loads; non-autoregressive or joint decoding techniques (e.g., single-pass flow-matching, context-pruning) improve efficiency (Yan et al., 10 Jun 2025, Sun et al., 12 Apr 2025).
Explainability and semantic grounding: Integration of language-based intent, visual AOI, and explicit behavioral priors remains a research frontier (Zhang et al., 2022, Takeyama et al., 2024).
Coverage of interaction topologies: Topological and graph-based invariances offer strong guarantees for intersection navigation, but scaling to open-domain multi-agent scenes and rare maneuvers remains open (Roh et al., 2020).

Research continues toward joint prediction–planning architectures, explainable and human-interpretable mode discovery, and the design of metrics capturing both diversity and physical/social plausibility of predicted distributions (Huang et al., 2023).

For further methodological and empirical details, consult the foundational and recent modeling papers (Cui et al., 2018, Phan-Minh et al., 2019, Luo et al., 2020, Wang et al., 2020, Yuan et al., 2021, Sun et al., 2021, Deo et al., 2021, Dendorfer et al., 2020, Brito et al., 2020, Zhang et al., 2022, Pathiraja et al., 2022, Sun et al., 12 Apr 2025, Yan et al., 10 Jun 2025, Huang et al., 2023, Roh et al., 2020, Takeyama et al., 2024, Sharma et al., 2024).