DILATE: Loss for Shape & Time in Forecasting

Updated 9 January 2026

The paper introduces DILATE with dual loss terms that separately optimize waveform shape fidelity and temporal precision.
It implements a soft-DTW shape term and a soft-TDI temporal term using dynamic programming for efficient, end-to-end gradient-based training.
Empirical results show significant MAE reductions over MSE and classic DTW, validating its effectiveness in diverse forecasting applications.

DILATE (Distortion Loss Incorporating Shape and Time) is a differentiable training objective designed for deep models in multi-step time series forecasting, with explicit sensitivity to both waveform shape and event timing. Unlike conventional losses such as Mean Squared Error (MSE) or classic Dynamic Time Warping (DTW), DILATE disentangles amplitude/shape fidelity and temporal precision via two smoothly-differentiable terms. The loss is implemented with dynamic programming, yielding computational efficiency and enabling end-to-end gradient-based training in complex neural architectures. Empirical evaluations demonstrate that DILATE substantially improves the detection and prediction of sudden or sharp changes in diverse domains such as physiological waveform modeling and traffic or electricity forecasting (Guen et al., 2021, Guen et al., 2019, Sahoo et al., 2 Jan 2026).

1. Mathematical Formulation and Key Components

The DILATE loss computes a weighted sum of two distortion terms over predicted and observed sequences of equal length, $y$ and $\hat y$ , parameterized by a tradeoff scalar $\alpha \in [0,1]$ and a soft-min smoothing parameter $\gamma > 0$ :

$\mathcal{L}_{\mathrm{DILATE}}(\hat y, y) = \alpha\,\mathcal{L}_{\mathrm{shape}}(\hat y, y) + (1-\alpha)\,\mathcal{L}_{\mathrm{temporal}}(\hat y, y)$

Shape term (soft-DTW): Measures morphological similarity via a log-sum-exp relaxation of classic DTW. For $\Delta_{ij} = \lVert \hat y_i - y_j \rVert^2$ (or $|\hat y_i - y_j|$ in (Sahoo et al., 2 Jan 2026)), soft-DTW is:

$\widetilde{\mathrm{DTW}}_\gamma(\hat y, y) = -\gamma\,\log\left[\sum_{A \in \mathcal{A}} \exp\left(-\langle A, \Delta \rangle / \gamma\right)\right]$

where $\mathcal{A}$ denotes monotonic alignment matrices. As $\gamma \to 0^+$ , soft-DTW converges to classic DTW.

Temporal term (soft-TDI): Assesses alignment by penalizing off-diagonal associations, measuring timing errors. Using a penalty matrix $\hat y$ 0 (typically $\hat y$ 1 or $\hat y$ 2):

$\hat y$ 3

with $\hat y$ 4, the expected soft-alignments under the Gibbs measure induced by $\hat y$ 5. This yields differentiability in both model output and input sequences.

2. Computational Implementation and Differentiation

Both shape and temporal components leverage dynamic programming for efficient evaluation and gradient computation:

Forward DP: Constructs recurrence matrices to compute soft-DTW in $\hat y$ 6 time and space, avoiding enumeration of all $\hat y$ 7 alignment paths.
Backward DP: Derivatives with respect to $\hat y$ 8 are computed via reverse dynamic programming. The temporal loss multiplies $\hat y$ 9 with $\alpha \in [0,1]$ 0 and backpropagates through the Hessian of $\alpha \in [0,1]$ 1 (Guen et al., 2019).
PyTorch integration: Custom autograd functions exploit intermediate DP tables for both forward and backward passes, yielding up to $\alpha \in [0,1]$ 2 speed-up versus naïve autograd for $\alpha \in [0,1]$ 3 (Guen et al., 2019, Guen et al., 2021).

3. Parameterization and Hyperparameter Selection

DILATE introduces the following tunable elements for practical deployment:

Parameter	Purpose	Typical Values
$\alpha \in [0,1]$ 4	Shape/timing balance in total loss	$\alpha \in [0,1]$ 5
$\alpha \in [0,1]$ 6	Smoothing in soft-minimum	$\alpha \in [0,1]$ 7 to $\alpha \in [0,1]$ 8
$\alpha \in [0,1]$ 9	Temporal penalty matrix	$\gamma > 0$ 0, $\gamma > 0$ 1
$\gamma > 0$ 2	KD vs DILATE weighting (Sahoo et al., 2 Jan 2026)	$\gamma > 0$ 3, $\gamma > 0$ 4

Cross-validation or ablation can identify optimal hyperparameters. For KDPhys (rPPG), best MAE was achieved for $\gamma > 0$ 5 and $\gamma > 0$ 6 (Sahoo et al., 2 Jan 2026). For generic forecasting and biomedical signals, similar mid-range $\gamma > 0$ 7 achieves robust shape and event detection.

4. Domain-Specific Roles and Significance

The shape term in DILATE enforces morphological fidelity, preserving oscillations, amplitudes, and salient event structure (e.g., systolic ramp in rPPG). The temporal term penalizes horizontal event shifts, ensuring sharp event localization—a property vital for domains where precise timing of peaks, drops, or change-points is critical.

For rPPG extraction, DILATE ensures predicted pulse cycles align in both morphology and event timing, outperforming MSE or frequency-domain losses that cannot separate these criteria (Sahoo et al., 2 Jan 2026). In traffic or ECG forecasting, DILATE detects sudden transitions and matches ground-truth event positions better than alternatives (Guen et al., 2019, Guen et al., 2021).

5. Comparative Evaluation and Ablation Results

Empirical studies confirm DILATE's advantages. On standard benchmarks:

Loss	MAE/DTW/TDI Improvement	Case
DILATE	MAE reduction by $\gamma > 0$ 8	KDPhys rPPG (Sahoo et al., 2 Jan 2026)
DILATE vs MSE	DTW down $\gamma > 0$ 9, TDI $\mathcal{L}_{\mathrm{DILATE}}(\hat y, y) = \alpha\,\mathcal{L}_{\mathrm{shape}}(\hat y, y) + (1-\alpha)\,\mathcal{L}_{\mathrm{temporal}}(\hat y, y)$ 0	ECG, traffic, electricity (Guen et al., 2021)
DILATE vs TD+FD	$\mathcal{L}_{\mathrm{DILATE}}(\hat y, y) = \alpha\,\mathcal{L}_{\mathrm{shape}}(\hat y, y) + (1-\alpha)\,\mathcal{L}_{\mathrm{temporal}}(\hat y, y)$ 1 lower MAE	UBFC rPPG (Sahoo et al., 2 Jan 2026)

In ablation, a sweep of $\mathcal{L}_{\mathrm{DILATE}}(\hat y, y) = \alpha\,\mathcal{L}_{\mathrm{shape}}(\hat y, y) + (1-\alpha)\,\mathcal{L}_{\mathrm{temporal}}(\hat y, y)$ 2 in DILATE shows optimal MAE and RMSE centrally ( $\mathcal{L}_{\mathrm{DILATE}}(\hat y, y) = \alpha\,\mathcal{L}_{\mathrm{shape}}(\hat y, y) + (1-\alpha)\,\mathcal{L}_{\mathrm{temporal}}(\hat y, y)$ 3); shifting towards pure shape or timing degrades performance (Sahoo et al., 2 Jan 2026). Separate shape and temporal terms outperform "tangled" DILATE or band-constrained soft-DTW (Guen et al., 2019).

6. Integration into Neural Architectures and Applications

DILATE's differentiability, DP-based efficiency, and agnostic design allow integration with MLPs, RNNs, Seq2Seq, Transformer, and convolutional architectures. No changes to output layers are required—only the loss function is modified (Guen et al., 2019, Guen et al., 2021).

In KDPhys, DILATE is combined with an Attention Feature Distillation term for end-to-end knowledge transfer from 3D CNN teachers to lightweight 2D students (Sahoo et al., 2 Jan 2026). In STRIPE++ (Guen et al., 2021), DILATE-driven kernels enable determinantal point process (DPP) diversity in probabilistic multi-future forecasting, with quality-weighting to encourage sharp and temporally varied forecasts.

DILATE is robust to noise, benefiting from $\mathcal{L}_{\mathrm{DILATE}}(\hat y, y) = \alpha\,\mathcal{L}_{\mathrm{shape}}(\hat y, y) + (1-\alpha)\,\mathcal{L}_{\mathrm{temporal}}(\hat y, y)$ 4 annealing and gradient clipping strategies for numerically demanding signals.

7. Limitations and Variants

DILATE's performance is dependent on correct balancing of shape and time—extreme values of $\mathcal{L}_{\mathrm{DILATE}}(\hat y, y) = \alpha\,\mathcal{L}_{\mathrm{shape}}(\hat y, y) + (1-\alpha)\,\mathcal{L}_{\mathrm{temporal}}(\hat y, y)$ 5 can render either criterion dominant to the detriment of the other. The quadratic time complexity ( $\mathcal{L}_{\mathrm{DILATE}}(\hat y, y) = \alpha\,\mathcal{L}_{\mathrm{shape}}(\hat y, y) + (1-\alpha)\,\mathcal{L}_{\mathrm{temporal}}(\hat y, y)$ 6) scales favourably for moderate-length sequences but becomes limiting in very long horizons. Tangled variants that combine shape and time in a single soft-DTW term underperform the separated approach (Guen et al., 2019).

For domain-specific applications such as noisy biomedical signals or traffic, asymmetric penalty matrices $\mathcal{L}_{\mathrm{DILATE}}(\hat y, y) = \alpha\,\mathcal{L}_{\mathrm{shape}}(\hat y, y) + (1-\alpha)\,\mathcal{L}_{\mathrm{temporal}}(\hat y, y)$ 7 or adaptive smoothing may further improve event localization, especially when early or late shifts have different functional interpretations.

DILATE establishes a rigorous, efficiently-computable, and highly adaptable framework for deep time series learning with sensitivity to both feature morphology and timing. Its widespread empirical validation positions it as a strong alternative to conventional loss formulations in forecasting and physiological waveform analysis (Guen et al., 2021, Sahoo et al., 2 Jan 2026, Guen et al., 2019).