MTP Loss for Multi-Modal Trajectory Forecasting

Updated 10 February 2026

Multiple-Trajectory Prediction (MTP) Loss is a supervised objective that predicts multiple future trajectories with associated probabilities to handle inherent uncertainty.
It integrates joint classification-regression loss terms and mode-wise winner selection, using parameters like epsilon and alpha to balance diversity and precision.
Widely applied in sports analytics, autonomous driving, and robotics, MTP Loss is extended with auxiliary losses to enhance rule adherence and mitigate off-road errors.

Multiple-Trajectory Prediction (MTP) Loss defines a supervised learning objective for multi-modal sequence forecasting, where the future evolution of an agent (e.g., player, vehicle) is inherently uncertain and best described as a finite set of alternative (mode) hypotheses, each with an associated probability. MTP loss optimizes both the diversity and accuracy of these hypotheses while providing a probabilistically interpretable selection mechanism using a mode-wise winner selection, joint classification-regression loss terms, and carefully weighted regularization. It has become a foundational component in state-of-the-art trajectory forecasting across sports analytics (Hauri et al., 2020), autonomous driving (Greer et al., 2020), and robotics (Thiede et al., 2019), with numerous extensions for rule-consistent, diverse, and robust multi-agent behavior prediction.

1. Formal Definition and Mathematical Structure

Given a ground-truth future sequence $\nu$ (e.g., velocity, position) and $M$ model-predicted hypotheses $\{\hat\nu_m\}_{m=1}^M$ with associated confidence logits $\{\ell_m\}$ and softmax probabilities $p_m = \operatorname{softmax}_m(\ell)$ per example, the canonical MTP loss (Hauri et al., 2020) is

$L^{\mathrm{MTP}}\left(\nu, \{\hat\nu_m\}, \{p_m\}\right) = \sum_{m=1}^M \delta_m\left[-\log p_m + \alpha L^{\mathrm{MSE}}(\nu, \hat\nu_m)\right]$

where for each $m$ ,

$L^{\mathrm{MSE}}(\nu, \hat\nu_m) = \frac{1}{2H} \lVert \nu - \hat\nu_m \rVert_2^2$

The Kronecker-delta-like weights $\delta_m$ enforce “winner-takes-most”:

$\delta_m = \begin{cases} 1-\epsilon & \text{if } m = m^* \ \epsilon/(M-1) & \text{otherwise} \end{cases}$

with $m^* = \arg\min_m \operatorname{dist}(\nu, \hat\nu_m)$ , using a user-selectable distance (MSE, final displacement, or final velocity). The relaxation $\epsilon \in [0,1]$ , annealed during training, prevents mode collapse and encourages initial mode diversity. The overall objective may be augmented by $\ell_2$ regularization:

$L_{\mathrm{total}} = L^{\mathrm{MTP}} + \lambda \lVert \theta \rVert^2$

where $\theta$ are model parameters (Hauri et al., 2020).

In the standard MTP formulation used in autonomous driving (Greer et al., 2020), $-\log p_{m^*}$ is paired with a regression loss evaluated exclusively on the winner; $\delta_m$ is a hard 1/0 selection, but the mechanism is equivalent.

2. Mode Selection, Loss Weighting, and Regularization

Selection of the “winning” mode $m^*$ is central—after predicting $M$ trajectories, the mode minimally distant from the ground-truth under the selected metric is chosen. All gradient flow to predicted trajectories and their probabilities is then weighted by $\delta_m$ . This restricts regression and classification penalties predominantly to the winner, softly supervising sub-optimal modes early in training if $\epsilon > 0$ (Hauri et al., 2020).

The hyperparameter $\alpha$ scales the regression loss relative to classification and is typically set so both terms have comparable scale. Empirically, $\epsilon$ is critical: values of $0.25$–$0.75$ for early iterations, annealed to $0$ over several epochs, maximize multi-modality while permitting eventual specialization (Hauri et al., 2020). Weight decay $\lambda$ (e.g., $10^{-3}$ ) provides additional regularization.

Choice of $\operatorname{dist}(\cdot, \cdot)$ is context-dependent: mean squared error is preferred for short-range predictions, while final-displacement metrics exhibit greater robustness over longer horizons.

3. Connections to Minimum-over-N (MoN) Loss, Diversity, and Density Bias

MTP loss generalizes the Minimum-over-N (MoN) or “variety” loss (Thiede et al., 2019), which for $N$ i.i.d. samples from a density $P(\cdot)$ assigns the per-example cost as the minimum distance to ground-truth:

$\ell_{\mathrm{MoN}}(x^*) = \min_{i=1,\ldots,N} d(x^*, x_i)$

This training strategy encourages diverse outputs but exhibits a critical bias: minimization of the population-level MoN loss yields a learned $P^*(x) \propto [P_T(x)]^{1/2}$ , not $P_T(x)$ , causing systematic “dilation” of the predicted density (Thiede et al., 2019). This effect is validated both in synthetic and complex real datasets (e.g., NGSIM, Social-GAN on Zara1), and correctable, in part, via “squared-sampling” or KDE power-reweighting to optimize test log-likelihoods.

A plausible implication is that vanilla MTP best-of- $M$ training will always trade off calibration and likelihood against expressive diversity, particularly for large $M$ and high-dimensional outputs, necessitating density compensation for likelihood-centric evaluations.

4. Extensions: Auxiliary Losses for Rule Adherence and Multi-Mode Supervision

Classic MTP propagates gradients primarily through the winning hypothesis, leaving non-winning modes only weakly structured. To address shortcomings such as off-road predictions or lack of directionality consistency, recent work applies additional loss terms to all modes. For instance, (Rahimi et al., 2024) introduces trajectory-wide Offroad Loss, Direction Consistency Error, and Mode Diversity Loss:

Offroad Loss: Penalizes each mode’s predicted positions outside the drivable area using a signed distance function and a safety margin.
Direction Consistency Error: Hinge-penalizes the heading and displacement from plausible map centerlines.
Mode Diversity Loss: Promotes pairwise spread among all feasible, on-road modes.

The total loss becomes

$L_{\mathrm{total}} = L_{\mathrm{orig}} + \lambda_{\mathrm{off}} L_{\mathrm{offroad}} + \lambda_{\mathrm{dir}} L_{\mathrm{dir}} + \lambda_{\mathrm{div}} L_{\mathrm{div}}$

where $L_{\mathrm{orig}}$ is the original MTP (winner-takes-all, e.g., minADE or minFDE), and auxiliary losses are densely averaged over all modes and timesteps. This increases overall mode feasibility, diversity, and reduces pathologies (e.g., off-road or wrong-direction errors), typically without degrading canonical accuracy metrics (Rahimi et al., 2024).

(Greer et al., 2020) extends MTP with a lane-heading auxiliary loss "YawLoss" that enforces consistency of each predicted mode with road lane direction, further filtering out implausible (e.g., wrong-way) trajectories. Intersection regions are ignored using masks for further specificity.

5. Implementation Protocols and Hyperparameter Settings

Canonical implementation (see (Hauri et al., 2020) and (Greer et al., 2020)) follows a two-stage protocol for each mini-batch:

Forward Pass:
- For each input, produce $M$ predicted hypotheses and their probabilities.
- Compute per-mode regression (MSE) losses and winner distances.
- Select winner $m^*$ ; compute $\delta_m$ weights.
- Sum classification and regression loss over $m$ , possibly including additional auxiliary losses per mode/timestep.
Backward Pass:
- Combine per-example losses into batch loss, adding $\ell_2$ regularization.
- Backpropagate and apply optimization step.

Hyperparameter summary (recommended settings from (Hauri et al., 2020, Greer et al., 2020, Rahimi et al., 2024)):

Parameter	Typical Value(s)	Comment
Number of Modes ( $M$ )	4–10 (NBA), 6–15 (AV)	Larger $M$ for higher modality, at computational cost
Relaxation ( $\epsilon$ )	0.25–0.75, annealed	Prevents mode collapse
Trade-off ( $\alpha$ )	$\approx 1.0$	Equalizes initial term scales
Weight Decay ( $\lambda$ )	$1\mathrm{e}{-3}$ – $1\mathrm{e}{-5}$	Standard regularization
Offroad/Diversity Weights	$0.1$–$0.5$	For auxiliary, empirically tuned
Batch Size	256–1024	Hardware-dependent
Learning Rate	$5\mathrm{e}{-4}$ (base), $1\mathrm{e}{-5}$ (fine-tune)

For auxiliary and diversity losses, weights are set via trade-off curves that optimize auxiliary metrics while preserving canonical accuracy (e.g., minADE ≤ baseline) (Rahimi et al., 2024).

6. Empirical Effects, Limitations, and Future Directions

Standard MTP achieves state-of-the-art accuracy across numerous forecasting domains, with key improvements:

Diversity: Multiple plausible futures, matching ground-truth variability.
Trajectory Realism: Player/vehicle behaviors better match context-specific multimodality—e.g., NBA player styles (Hauri et al., 2020).
Robustness: Extensions with coverage-promoting and rule-based auxiliary losses halve typical offroad error rates and enhance trajectory diversity without loss of minADE/minFDE (Rahimi et al., 2024).

Reported results show, for example, offroad error drops of ≈47% on nuScenes and Argoverse 2, direction error reductions of ~31%, and significant improvements in mean pairwise trajectory diversity (Rahimi et al., 2024). Synthesizing multiple ground-truth trajectories through Markov chain data augmentation and multimodality matching protocols further amplifies diversity and accuracy benefits (Berlincioni et al., 2020).

Unresolved issues remain:

Density Calibration: The square-root bias of MoN/MTP prevents calibrated probability estimation; solution strategies rely on post hoc power reweighting or alternative objectives (Thiede et al., 2019).
Coverage–Accuracy Tradeoff: Increasing $M$ improves coverage but complicates winner assignment and tuning.
Auxiliary Loss Tuning: Overweighting auxiliary losses can destabilize training or degrade primary task performance; careful staged scheduling and validation are critical (Rahimi et al., 2024).
Infrastructure Requirements: Marginal/auxiliary losses often require precise map data and geometric functions (e.g., signed distance fields, heading maps) (Rahimi et al., 2024, Greer et al., 2020).

Prospective directions include latent-variable decoders to sidestep explicit mode competition, soft mode assignment, and the integration of context-dependent auxiliary losses for improved rule-adherence and calibration.

7. Historical Development and Application Domains

The MTP framework originates in multi-agent prediction settings requiring the representation of multi-modal futures conditioned on ambiguous partial observations. Early State-of-the-Art LSTM and CNN architectures incorporated MTP for human (NBA) dynamics (Hauri et al., 2020).

Subsequent work in autonomous vehicles generalized MTP to map- and rule-aware frameworks (MultiPath, LaneGCN, etc.), leading to widespread adoption in the nuScenes, Argoverse, and KITTI benchmarks (Greer et al., 2020, Rahimi et al., 2024, Berlincioni et al., 2020). The method has also inspired parallel probabilistic approaches including GAN-based, flow-based, and normalizing flow architectures—all incorporating MTP or MoN losses as a core optimization principle.

The MTP paradigm remains a fundamental approach for scalable, multi-hypothesis, context-sensitive future prediction, continually extended for improved feasibility, diversity, and probability calibration.