Soft-Token Trajectory Forecasting (SoTra)
- SoTra is a forecasting framework that uses continuous soft tokens for uncertainty-aware, autoregressive prediction in safety-critical applications.
- It employs a two-stage training process—initial teacher forcing followed by soft-token unroll—to mitigate exposure bias and ensure differentiated risk minimization.
- Experimental results on glucose and blood pressure datasets demonstrate significant reductions in forecast risk and improved calibration for clinical decision support.
Soft-Token Trajectory Forecasting (SoTra) is a framework for uncertainty-aware, autoregressive time series prediction that addresses exposure bias by propagating continuous distributional representations throughout the forecasting trajectory. SoTra is particularly motivated by safety-critical healthcare domains, such as predictive control in diabetes and hemodynamic management, where correct handling of forecast uncertainty and clinical risk asymmetry is paramount. The approach introduces “soft tokens”—continuous probability-weighted embeddings—combined with a two-stage curriculum and post-hoc risk-aware decoding to generate calibrated, risk-minimizing forecasts across multiple steps (Namazi et al., 10 Dec 2025).
1. Soft Tokens and Distributional Propagation
Standard autoregressive sequence models, such as decoder-only Transformers, typically operate on discrete “hard” tokens: the predicted class at each time step is either the maximum-likelihood index or a stochastic sample from the token distribution. In contrast, SoTra introduces “soft tokens.” At each forecasting step , instead of feeding a sampled token into the next step, SoTra computes a probability-weighted embedding:
where is the token embedding matrix, and is the output distribution. This soft embedding is a differentiable function of the model output, so gradients propagate seamlessly across the entire multi-step trajectory. This construction eliminates the non-differentiability imposed by discrete sampling or argmax operations and enables fully differentiable autoregressive training unrolled across multiple steps.
2. Exposure Bias and Curriculum Training
Exposure bias arises in autoregressive models trained with teacher forcing: they are exposed to ground-truth history during training but must rely on their own predictions at test time. In clinical forecasting, compounding deviations can yield unstable or unsafe output sequences. SoTra addresses this with a two-stage training process:
- Next-token Pre-training (Teacher Forcing): Standard next-token prediction using cross-entropy loss, where ground-truth tokens are available as inputs.
- Trajectory Fine-tuning (Soft-token Unroll): Multi-step trajectory rollouts without teacher forcing, but using soft tokens. At each time step, the predicted distribution’s embedding feeds forward, and cross-entropy is accumulated across all forecast steps. This trajectory-level objective remains differentiable due to soft-token propagation.
This two-stage curriculum facilitates initial learning stability and then trains the model to operate in its own distributional prediction regime.
3. Risk-Aware Decoding
Many clinical control applications exhibit asymmetric risk: misprediction in some zones (e.g., hypoglycemic or hypertensive regimes) is costlier than in others. Traditional loss functions such as MSE cannot encode these domain-specific priorities. SoTra decouples the predictive modeling phase from downstream utility by:
- Defining disjoint zones in the plane, where each zone has weight corresponding to risk.
- Introducing a zone-based risk function:
- At inference, the risk-aware decoder returns the value that minimizes the expected sum of zone-based risk and MSE over the forecast distribution:
where gives the real-valued bin center and balances risk and MSE.
By applying this procedure only at decode time, SoTra maintains well-calibrated probabilistic forecasts while allowing maximal flexibility in downstream control.
4. Algorithmic Workflow
The SoTra framework is implemented using the following high-level training regime:
- Initialization: Discretize continuous targets into bins; supervise with cross-entropy loss.
- Stage 1: Train autoregressively with teacher forcing over observed history.
- Stage 2: Fine-tune unrolled multi-step sequences using soft-token embeddings as the recurrent input, enabling gradient computation through the entire prediction horizon.
- Decoding: At inference, apply the risk-aware decoder to translate forecasted distributions into point estimates minimizing domain-specific risk.
Pseudocode for the two-stage training incorporates cross-entropy accumulation across the full trajectory and avoids sampling or hard token selection at every step, enabling stable and efficient optimization.
5. Experimental Evaluation
SoTra has been evaluated on clinical forecasting tasks, including:
- Glucose (DCLP3, PEDAP): 24-hour history, 0.5–4 hour prediction horizons.
- Blood Pressure (MBP, SBP): 30-minute history, 1–8 minute horizons.
Key metrics include zone-based clinical risk (Clarke Error Grid for glucose, Saugel's error grid for BP), risky forecast percentage, RMSE, and probabilistic calibration via CRPS.
Summary of quantitative findings:
| Dataset | SoTra Risk | Best-Baseline Risk | % Reduction |
|---|---|---|---|
| DCLP3 | 0.583 | 0.712 (PatchTST) | 18% |
| PEDAP | 0.791 | 0.964 (iTrans) | 18% |
| MBP | 1.103 | 1.255 (best) | 12% |
| SBP | 1.070 | 1.251 (PatchTST) | 15% |
SoTra reduces average zone-based risk by approximately 18% for glucose and 15% for blood pressure forecasting compared to state-of-the-art baselines. Reductions in risky forecasts (zones C–E) are up to 32% for glucose and 24% for blood pressure. RMSE remains within 4.7% of the best MSE-optimized models.
Ablation studies show that both soft-token trajectory training and risk-aware decoding are necessary for obtaining the full gains; removing either leads to deteriorated risk or RMSE performance. Hyperparameter sweeps reveal a trade-off between decreased clinical risk and increased RMSE as increases.
6. Uncertainty Calibration and Model Predictive Control
Calibration is evaluated using the Continuous Ranked Probability Score (CRPS) and reliability diagrams. SoTra achieves the lowest CRPS values across all datasets (e.g., 15.83 for DCLP3, compared to 17.08 for Chronos), confirming that soft-token trajectory training yields better-calibrated uncertainty than hard-sampled or non-trajectory baselines. Reliable uncertainty estimates allow model predictive control (MPC) systems to use zone-aware probabilistic forecasts to anticipate and mitigate clinically adverse events, enhancing the safety of closed-loop control strategies.
7. Contributions and Domain Impact
SoTra’s primary contributions are the integration of distributional soft tokens into autoregressive unrolling to overcome exposure bias, and the separation of calibrated probabilistic forecasting from risk-minimizing decoding aligned with real-world clinical priorities. In the clinical forecasting context, SoTra produces trajectories that (i) are less prone to compounding error over prediction horizons, (ii) maintain competitive conventional accuracy, and (iii) substantially reduce high-risk errors. These properties make it applicable to domains where reliable multi-step forecasting under distribution shift and risk asymmetry is critical (Namazi et al., 10 Dec 2025).