Soft-Token Trajectory Forecasting (SoTra)

Updated 13 December 2025

SoTra is a forecasting framework that uses continuous soft tokens for uncertainty-aware, autoregressive prediction in safety-critical applications.
It employs a two-stage training process—initial teacher forcing followed by soft-token unroll—to mitigate exposure bias and ensure differentiated risk minimization.
Experimental results on glucose and blood pressure datasets demonstrate significant reductions in forecast risk and improved calibration for clinical decision support.

Soft-Token Trajectory Forecasting (SoTra) is a framework for uncertainty-aware, autoregressive time series prediction that addresses exposure bias by propagating continuous distributional representations throughout the forecasting trajectory. SoTra is particularly motivated by safety-critical healthcare domains, such as predictive control in diabetes and hemodynamic management, where correct handling of forecast uncertainty and clinical risk asymmetry is paramount. The approach introduces “soft tokens”—continuous probability-weighted embeddings—combined with a two-stage curriculum and post-hoc risk-aware decoding to generate calibrated, risk-minimizing forecasts across multiple steps (Namazi et al., 10 Dec 2025).

1. Soft Tokens and Distributional Propagation

Standard autoregressive sequence models, such as decoder-only Transformers, typically operate on discrete “hard” tokens: the predicted class at each time step is either the maximum-likelihood index or a stochastic sample from the token distribution. In contrast, SoTra introduces “soft tokens.” At each forecasting step $t$ , instead of feeding a sampled token into the next step, SoTra computes a probability-weighted embedding:

$\hat{\boldsymbol p}_t \in \Delta^{V-1}, \qquad \boldsymbol e_t = E^\top\,\hat{\boldsymbol p}_t \in \mathbb{R}^d$

where $E \in \mathbb{R}^{V\times d}$ is the token embedding matrix, and $\hat{\boldsymbol p}_t$ is the output distribution. This soft embedding is a differentiable function of the model output, so gradients propagate seamlessly across the entire multi-step trajectory. This construction eliminates the non-differentiability imposed by discrete sampling or argmax operations and enables fully differentiable autoregressive training unrolled across multiple steps.

2. Exposure Bias and Curriculum Training

Exposure bias arises in autoregressive models trained with teacher forcing: they are exposed to ground-truth history during training but must rely on their own predictions at test time. In clinical forecasting, compounding deviations can yield unstable or unsafe output sequences. SoTra addresses this with a two-stage training process:

Next-token Pre-training (Teacher Forcing): Standard next-token prediction using cross-entropy loss, where ground-truth tokens are available as inputs.
Trajectory Fine-tuning (Soft-token Unroll): Multi-step trajectory rollouts without teacher forcing, but using soft tokens. At each time step, the predicted distribution’s embedding feeds forward, and cross-entropy is accumulated across all forecast steps. This trajectory-level objective remains differentiable due to soft-token propagation.

This two-stage curriculum facilitates initial learning stability and then trains the model to operate in its own distributional prediction regime.

3. Risk-Aware Decoding

Many clinical control applications exhibit asymmetric risk: misprediction in some zones (e.g., hypoglycemic or hypertensive regimes) is costlier than in others. Traditional loss functions such as MSE cannot encode these domain-specific priorities. SoTra decouples the predictive modeling phase from downstream utility by:

Defining $K$ disjoint zones $\{Z_1, ..., Z_K\}$ in the $(x, \hat x)$ plane, where each zone $Z_k$ has weight $w_k$ corresponding to risk.
Introducing a zone-based risk function:

$f_r(x, \hat x) = w_k \quad \text{if } (x, \hat x) \in Z_k$

At inference, the risk-aware decoder returns the value that minimizes the expected sum of zone-based risk and MSE over the forecast distribution:

$\hat x_t = \phi\left(\arg\min_{x \in \{1, \dots, V\}} \sum_{v=1}^V \hat p_{t,v} \left[\lambda f_r(\phi(x), \phi(v)) + (\phi(x) - \phi(v))^2 \right]\right)$

where $\phi(v)$ gives the real-valued bin center and $\lambda \ge 0$ balances risk and MSE.

By applying this procedure only at decode time, SoTra maintains well-calibrated probabilistic forecasts while allowing maximal flexibility in downstream control.

4. Algorithmic Workflow

The SoTra framework is implemented using the following high-level training regime:

Initialization: Discretize continuous targets into $V$ bins; supervise with cross-entropy loss.
Stage 1: Train autoregressively with teacher forcing over observed history.
Stage 2: Fine-tune unrolled multi-step sequences using soft-token embeddings as the recurrent input, enabling gradient computation through the entire prediction horizon.
Decoding: At inference, apply the risk-aware decoder to translate forecasted distributions into point estimates minimizing domain-specific risk.

Pseudocode for the two-stage training incorporates cross-entropy accumulation across the full trajectory and avoids sampling or hard token selection at every step, enabling stable and efficient optimization.

5. Experimental Evaluation

SoTra has been evaluated on clinical forecasting tasks, including:

Glucose (DCLP3, PEDAP): 24-hour history, 0.5–4 hour prediction horizons.
Blood Pressure (MBP, SBP): 30-minute history, 1–8 minute horizons.

Key metrics include zone-based clinical risk (Clarke Error Grid for glucose, Saugel's error grid for BP), risky forecast percentage, RMSE, and probabilistic calibration via CRPS.

Summary of quantitative findings:

Dataset	SoTra Risk	Best-Baseline Risk	% Reduction
DCLP3	0.583	0.712 (PatchTST)	18%
PEDAP	0.791	0.964 (iTrans)	18%
MBP	1.103	1.255 (best)	12%
SBP	1.070	1.251 (PatchTST)	15%

SoTra reduces average zone-based risk by approximately 18% for glucose and 15% for blood pressure forecasting compared to state-of-the-art baselines. Reductions in risky forecasts (zones C–E) are up to 32% for glucose and 24% for blood pressure. RMSE remains within 4.7% of the best MSE-optimized models.

Ablation studies show that both soft-token trajectory training and risk-aware decoding are necessary for obtaining the full gains; removing either leads to deteriorated risk or RMSE performance. Hyperparameter sweeps reveal a trade-off between decreased clinical risk and increased RMSE as $\lambda$ increases.

6. Uncertainty Calibration and Model Predictive Control

Calibration is evaluated using the Continuous Ranked Probability Score (CRPS) and reliability diagrams. SoTra achieves the lowest CRPS values across all datasets (e.g., 15.83 for DCLP3, compared to 17.08 for Chronos), confirming that soft-token trajectory training yields better-calibrated uncertainty than hard-sampled or non-trajectory baselines. Reliable uncertainty estimates allow model predictive control (MPC) systems to use zone-aware probabilistic forecasts to anticipate and mitigate clinically adverse events, enhancing the safety of closed-loop control strategies.

7. Contributions and Domain Impact

SoTra’s primary contributions are the integration of distributional soft tokens into autoregressive unrolling to overcome exposure bias, and the separation of calibrated probabilistic forecasting from risk-minimizing decoding aligned with real-world clinical priorities. In the clinical forecasting context, SoTra produces trajectories that (i) are less prone to compounding error over prediction horizons, (ii) maintain competitive conventional accuracy, and (iii) substantially reduce high-risk errors. These properties make it applicable to domains where reliable multi-step forecasting under distribution shift and risk asymmetry is critical (Namazi et al., 10 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Mitigating Exposure Bias in Risk-Aware Time Series Forecasting with Soft Tokens (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Soft-Token Trajectory Forecasting (SoTra).

Soft-Token Trajectory Forecasting (SoTra)

1. Soft Tokens and Distributional Propagation

2. Exposure Bias and Curriculum Training

3. Risk-Aware Decoding

4. Algorithmic Workflow

5. Experimental Evaluation

6. Uncertainty Calibration and Model Predictive Control

7. Contributions and Domain Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Soft-Token Trajectory Forecasting (SoTra)

1. Soft Tokens and Distributional Propagation

2. Exposure Bias and Curriculum Training

3. Risk-Aware Decoding

4. Algorithmic Workflow

5. Experimental Evaluation

6. Uncertainty Calibration and Model Predictive Control

7. Contributions and Domain Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research