Direct Forecasting Paradigm
- Direct Forecasting (DF) is a paradigm that directly maps input time series to a full forecast horizon in a single step, bypassing sequential prediction.
- It encounters optimization challenges due to conflicting gradients between near-term and long-term predictions, leading to underfitting of local dynamics.
- DF is widely used in architectures like MLP and Transformers yet faces limitations in flexibility and sample efficiency compared to evolutionary forecasting.
The Direct Forecasting (DF) paradigm is a foundational approach in long-term time series forecasting (LTSF), characterized by training models to predict the entire target horizon in a single, non-autoregressive forward pass. The methodology, its optimization challenges, and its evolving role in recent research on LTSF are summarized below, drawing primarily from comprehensive analyses and experimental evidence in the recent literature (Ma et al., 30 Jan 2026).
1. Definition and Operational Mechanism
In the DF paradigm, a parametric function is trained such that, given the last observations , it directly outputs the full -step prediction for all channels. The entire output window is generated in a single pass; there is no sequential or feedback structure at inference time. The canonical training objective is empirical risk minimization over the target window, often with mean squared or absolute error.
This direct multi-output approach is in contrast to iterative or autoregressive forecasting, which unrolls predictions stepwise, feeding each prediction as the next input. DF achieves high inference efficiency and enables clean information exchange across all steps in the horizon.
2. Optimization Pathology in Direct Forecasting
Empirical studies have revealed an optimization pathology intrinsic to DF, where joint training over long horizons induces a fundamental conflict between learning accurate near-term and long-term predictions (Ma et al., 30 Jan 2026). The underlying phenomena are:
- Adversarial Gradient Conflict:
When the horizon is partitioned into segments (e.g., near, medium, far future), the gradients from the loss on near-term segments are almost orthogonal to, or actively in opposition with, the gradients from the overall horizon loss. In practice, as the forecast horizon grows, the cosine similarity between the gradient for early segments () and the total loss gradient () approaches zero or becomes negative, indicating strong optimization conflict.
- Distal Dominance:
Although the norm of the near-term segment gradient can be large (reflecting underfitting on recent steps), the model's parameter updates are dominated by the gradients from distant (far future) steps due to their larger alignment (cosine similarity) with the total gradient. This causes the optimization trajectory to overweight fitting the end of the forecast horizon at the expense of near-term accuracy.
- Empirical Consequence:
The net effect is chronic underfitting of local dynamics (i.e., the earliest part of the forecasting window), even when overall training loss is minimized. This underfitting cannot be fully addressed by modifying the loss function or adding auxiliary alignment terms without architectural changes.
3. Architectural Embodiment and Evaluation
DF has been widely adopted as the default output format in both MLP-based (e.g., TimeMixer, MDMixer) and Transformer-based architectures (Gao et al., 13 May 2025, Shen et al., 17 Jul 2025). Model classes using DF include:
- MLP-based patching frameworks with multi-scale or per-channel output heads.
- Transformer models (PatchTST, FEDformer, iTransformer) employing a direct mapping from the look-back window to the entire horizon, often with various normalization, aggregation, and attention strategies.
- Linear-centric and mixture-of-expert forecasters where the entirety of the future window is predicted analogously.
Comprehensive experiments consistently show that DF, as a training and evaluation paradigm, tends to outperform autoregressive models in raw accuracy, particularly when joint dependencies exist across target steps (Shen et al., 17 Jul 2025). However, the optimization challenges described above limit further scaling and generalization.
4. Theoretical Position and Relationship to Evolutionary Forecasting
The Evolutionary Forecasting (EF) paradigm (Ma et al., 30 Jan 2026) provides a strict formalization of the relationship between DF and more general generative approaches. EF decouples the model's output horizon from the evaluation horizon and generates predictions via sequential rollouts of a fixed -step operator. The DF paradigm is mathematically a degenerate special case of EF achieved by setting , i.e., the entire forecast window is generated in one step, and during training, teacher forcing is applied on the full output window.
This result unifies previous approaches and demonstrates that DF lacks the flexibility and sample efficiency of block-wise, evolutionary generation, especially as increases. Importantly, EF substantially mitigates the gradient conflict pathology by:
- Allowing training on short output horizons (), isolating local dynamics.
- Rolling the operator forward iteratively at inference, enabling robust extrapolation.
- Providing robust sample counts (more effective training windows), especially at large .
5. Experimental Comparison and Paradigm Shift
Key empirical findings, based on systematic benchmarks and ablation studies (Ma et al., 30 Jan 2026), include:
| Method | Retrain per ? | Stable for large ? | Near-term fit | Extrapolation | Sample efficiency |
|---|---|---|---|---|---|
| Direct (DF) | Yes | No | Poor | Collapses | Poor at |
| Evolutionary (EF) | No | Yes | Strong | Robust | Good |
Experiments reveal:
- DF underfits near-term targets for large , while EF consistently maintains stability and better extrapolation, even as .
- A single EF model (trained at any reasonable ) can outperform an ensemble of DF models each specifically retrained for every (>80% of test cases).
- Asymptotic stability is achieved when using EF, with smooth error growth and strong robustness to extreme extrapolation.
6. Limitations and Ongoing Adaptation
DF's efficiency and synergy with direct-mapping architecture, multi-scale aggregation, and patching have underpinned many state-of-the-art results in recent years (Gao et al., 13 May 2025, Shen et al., 17 Jul 2025). Nevertheless, its rigid coupling of architectural output horizon and evaluation task yields inflexibility—necessitating complete retraining for each , and sample impoverishment for large .
A major contemporary trend is the shift from passive static mapping (DF) to evolutionary reasoning (EF). This transition is driven by empirical evidence of the optimization pathology in DF and the demonstrated superior performance, generality, and stability of EF (Ma et al., 30 Jan 2026).
7. Summary Table: Direct Forecasting vs. Evolutionary Forecasting
| Aspect | Direct Forecasting (DF) | Evolutionary Forecasting (EF) |
|---|---|---|
| Output horizon in model | ||
| Training loss | On entire window | Only first steps |
| Inference | Single forward for | Iterative block rollouts |
| Sample count () | ||
| Near-term fit | Chronic underfit (as increases) | Robust |
| Extreme extrapolation | Unstable/collapses | Stable, “rolls out” prediction |
| Retraining per | Required | Not required (“one-for-all”) |
References
- "To See Far, Look Close: Evolutionary Forecasting for Long-term Time Series" (Ma et al., 30 Jan 2026)
- "A Multi-scale Representation Learning Framework for Long-Term Time Series Forecasting" (Gao et al., 13 May 2025)
- "The Power of Architecture: Deep Dive into Transformer Architectures for Long-Term Time Series Forecasting" (Shen et al., 17 Jul 2025)