Direct Forecasting Paradigm

Updated 6 February 2026

Direct Forecasting (DF) is a paradigm that directly maps input time series to a full forecast horizon in a single step, bypassing sequential prediction.
It encounters optimization challenges due to conflicting gradients between near-term and long-term predictions, leading to underfitting of local dynamics.
DF is widely used in architectures like MLP and Transformers yet faces limitations in flexibility and sample efficiency compared to evolutionary forecasting.

The Direct Forecasting (DF) paradigm is a foundational approach in long-term time series forecasting (LTSF), characterized by training models to predict the entire target horizon in a single, non-autoregressive forward pass. The methodology, its optimization challenges, and its evolving role in recent research on LTSF are summarized below, drawing primarily from comprehensive analyses and experimental evidence in the recent literature (Ma et al., 30 Jan 2026).

1. Definition and Operational Mechanism

In the DF paradigm, a parametric function $f_\theta: \mathbb{R}^{T \times C} \to \mathbb{R}^{H \times C}$ is trained such that, given the last $T$ observations $X \in \mathbb{R}^{T \times C}$ , it directly outputs the full $H$ -step prediction for all $C$ channels. The entire output window is generated in a single pass; there is no sequential or feedback structure at inference time. The canonical training objective is empirical risk minimization over the target window, often with mean squared or absolute error.

This direct multi-output approach is in contrast to iterative or autoregressive forecasting, which unrolls predictions stepwise, feeding each prediction as the next input. DF achieves high inference efficiency and enables clean information exchange across all steps in the horizon.

2. Optimization Pathology in Direct Forecasting

Empirical studies have revealed an optimization pathology intrinsic to DF, where joint training over long horizons induces a fundamental conflict between learning accurate near-term and long-term predictions (Ma et al., 30 Jan 2026). The underlying phenomena are:

Adversarial Gradient Conflict:

When the horizon $H$ is partitioned into segments (e.g., near, medium, far future), the gradients from the loss on near-term segments are almost orthogonal to, or actively in opposition with, the gradients from the overall horizon loss. In practice, as the forecast horizon grows, the cosine similarity between the gradient for early segments ( $g_{near}$ ) and the total loss gradient ( $g_{all}$ ) approaches zero or becomes negative, indicating strong optimization conflict.

Distal Dominance:

Although the norm of the near-term segment gradient can be large (reflecting underfitting on recent steps), the model's parameter updates are dominated by the gradients from distant (far future) steps due to their larger alignment (cosine similarity) with the total gradient. This causes the optimization trajectory to overweight fitting the end of the forecast horizon at the expense of near-term accuracy.

Empirical Consequence:

The net effect is chronic underfitting of local dynamics (i.e., the earliest part of the forecasting window), even when overall training loss is minimized. This underfitting cannot be fully addressed by modifying the loss function or adding auxiliary alignment terms without architectural changes.

3. Architectural Embodiment and Evaluation

DF has been widely adopted as the default output format in both MLP-based (e.g., TimeMixer, MDMixer) and Transformer-based architectures (Gao et al., 13 May 2025, Shen et al., 17 Jul 2025). Model classes using DF include:

MLP-based patching frameworks with multi-scale or per-channel output heads.
Transformer models (PatchTST, FEDformer, iTransformer) employing a direct mapping from the look-back window to the entire horizon, often with various normalization, aggregation, and attention strategies.
Linear-centric and mixture-of-expert forecasters where the entirety of the future window is predicted analogously.

Comprehensive experiments consistently show that DF, as a training and evaluation paradigm, tends to outperform autoregressive models in raw accuracy, particularly when joint dependencies exist across target steps (Shen et al., 17 Jul 2025). However, the optimization challenges described above limit further scaling and generalization.

4. Theoretical Position and Relationship to Evolutionary Forecasting

The Evolutionary Forecasting (EF) paradigm (Ma et al., 30 Jan 2026) provides a strict formalization of the relationship between DF and more general generative approaches. EF decouples the model's output horizon $L$ from the evaluation horizon $H$ and generates predictions via sequential rollouts of a fixed $L$ -step operator. The DF paradigm is mathematically a degenerate special case of EF achieved by setting $L = H$ , i.e., the entire forecast window is generated in one step, and during training, teacher forcing is applied on the full output window.

This result unifies previous approaches and demonstrates that DF lacks the flexibility and sample efficiency of block-wise, evolutionary generation, especially as $H$ increases. Importantly, EF substantially mitigates the gradient conflict pathology by:

Allowing training on short output horizons ( $L \ll H$ ), isolating local dynamics.
Rolling the operator forward iteratively at inference, enabling robust extrapolation.
Providing robust sample counts (more effective training windows), especially at large $H$ .

5. Experimental Comparison and Paradigm Shift

Key empirical findings, based on systematic benchmarks and ablation studies (Ma et al., 30 Jan 2026), include:

Method	Retrain per $H$ ?	Stable for large $H$ ?	Near-term fit	Extrapolation	Sample efficiency
Direct (DF)	Yes	No	Poor	Collapses	Poor at $H \to N$
Evolutionary (EF)	No	Yes	Strong	Robust	Good

Experiments reveal:

DF underfits near-term targets for large $H$ , while EF consistently maintains stability and better extrapolation, even as $H \gg L$ .
A single EF model (trained at any reasonable $L$ ) can outperform an ensemble of DF models each specifically retrained for every $H$ (>80% of test cases).
Asymptotic stability is achieved when using EF, with smooth error growth and strong robustness to extreme extrapolation.

6. Limitations and Ongoing Adaptation

DF's efficiency and synergy with direct-mapping architecture, multi-scale aggregation, and patching have underpinned many state-of-the-art results in recent years (Gao et al., 13 May 2025, Shen et al., 17 Jul 2025). Nevertheless, its rigid coupling of architectural output horizon and evaluation task yields inflexibility—necessitating complete retraining for each $H$ , and sample impoverishment for large $H$ .

A major contemporary trend is the shift from passive static mapping (DF) to evolutionary reasoning (EF). This transition is driven by empirical evidence of the optimization pathology in DF and the demonstrated superior performance, generality, and stability of EF (Ma et al., 30 Jan 2026).

7. Summary Table: Direct Forecasting vs. Evolutionary Forecasting

Aspect	Direct Forecasting (DF)	Evolutionary Forecasting (EF)
Output horizon in model	$H$	$L \ll H$
Training loss	On entire $H$ window	Only first $L$ steps
Inference	Single forward for $H$	Iterative block rollouts
Sample count ( $H \to N$ )	$N - (T+H)+1 \sim 1$	$\mathcal{O}(N)$
Near-term fit	Chronic underfit (as $H$ increases)	Robust
Extreme extrapolation	Unstable/collapses	Stable, “rolls out” prediction
Retraining per $H$	Required	Not required (“one-for-all”)

References

"To See Far, Look Close: Evolutionary Forecasting for Long-term Time Series" (Ma et al., 30 Jan 2026)
"A Multi-scale Representation Learning Framework for Long-Term Time Series Forecasting" (Gao et al., 13 May 2025)
"The Power of Architecture: Deep Dive into Transformer Architectures for Long-Term Time Series Forecasting" (Shen et al., 17 Jul 2025)

Markdown Report Issue Upgrade to Chat

References (3)

To See Far, Look Close: Evolutionary Forecasting for Long-term Time Series (2026)

A Multi-scale Representation Learning Framework for Long-Term Time Series Forecasting (2025)

The Power of Architecture: Deep Dive into Transformer Architectures for Long-Term Time Series Forecasting (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Direct Forecasting (DF) Paradigm.

Direct Forecasting Paradigm

1. Definition and Operational Mechanism

2. Optimization Pathology in Direct Forecasting

3. Architectural Embodiment and Evaluation

4. Theoretical Position and Relationship to Evolutionary Forecasting

5. Experimental Comparison and Paradigm Shift

6. Limitations and Ongoing Adaptation

7. Summary Table: Direct Forecasting vs. Evolutionary Forecasting

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Direct Forecasting Paradigm

1. Definition and Operational Mechanism

2. Optimization Pathology in Direct Forecasting

3. Architectural Embodiment and Evaluation

4. Theoretical Position and Relationship to Evolutionary Forecasting

5. Experimental Comparison and Paradigm Shift

6. Limitations and Ongoing Adaptation

7. Summary Table: Direct Forecasting vs. Evolutionary Forecasting

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research