Label Horizon Paradox: Forecasting Insight
- Label Horizon Paradox is defined as the phenomenon where an intermediate training label yields higher out-of-sample correlation with the forecast target than the canonical final label.
- A bi-level optimization framework jointly updates model weights and label weights, revealing that selecting an optimal intermediate horizon (h*) improves performance metrics like IC and Sharpe ratios.
- Empirical evidence on Chinese equity universes demonstrates that dynamically optimizing supervision signals mitigates noise accumulation and enhances predictive accuracy in financial forecasting.
The Label Horizon Paradox is a phenomenon in supervised learning for financial forecasting wherein the optimal supervision signal—the label provided to a machine learning model during training—does not coincide with the horizon of the prediction target for which generalization is ultimately required. Classic forecasting pipelines tacitly assume that the best training objective matches the inference goal (i.e., the label is constructed at if predictions are evaluated at ). Song et al. have rigorously challenged this canon by demonstrating that, due to the interaction between time-varying signal realization and cumulative noise, superior predictive performance is consistently achieved by training with labels at an intermediate horizon , where . This constitutes the Label Horizon Paradox: the training label that maximizes out-of-sample correlation with the final target does not align with the application horizon but rather with a dynamically optimized proxy (Song et al., 3 Feb 2026).
1. Formal Definition and Empirical Manifestation
The Label Horizon Paradox is defined as the empirical observation that "minimizing training error on the canonical target horizon does not guarantee optimal generalization on ." Instead, it is well documented that models trained on an intermediate horizon , with , yield higher out-of-sample predictive correlation (typically measured by the information coefficient, or IC) on the true inference target. This overturns the default label-matching doctrine pervasive in short-term financial modeling (Song et al., 3 Feb 2026).
2. Dynamic Signal–Noise Trade-Off: Theoretical Foundation
The origin of the paradox lies in a time-dependent trade-off between the rates at which market alpha (signal) and idiosyncratic risk (noise) accumulate as a function of the forecasting horizon. The dynamic is formalized using a continuous-time Arbitrage Pricing Theory (APT) model in which realized returns from to are given by
with denoting whitened factor exposures, the cumulative signal realization function (monotonic, , ), and modeling random walk noise.
The squared out-of-sample correlation (information coefficient) of the estimator trained at horizon and evaluated at horizon is
with a data-dependent constant.
Here, two marginal effects are distinguished:
- Signal Gain (S(h)): , capturing the incremental informativeness as increases;
- Noise Accumulation (N(h)): , representing compounding uncertainty.
There exists a unique maximizing horizon that solves
subject to the first-order condition
When marginal signal gain falls below the pace of noise accumulation, moving to longer horizons hurts predictability, which explains why is typically less than (Song et al., 3 Feb 2026).
3. Bi-Level Optimization Framework for Adaptive Label Discovery
Because and are unknown a priori, the optimal proxy horizon must be learned from data rather than imposed by design. Song et al. introduce a bi-level optimization approach where the label horizon weights (the -dimensional simplex) are treated as parameters, optimized jointly with model weights . The upper-level (outer) objective evaluates generalization error on the canonical target using model parameters obtained via the lower-level (inner) objective, which is a label-weighted loss across potential horizons. To encourage exploration and prevent degenerate solutions, an entropy penalty is applied to the simplex.
Optimization proceeds in two stages:
- A warm-up period with mean-field labels to stabilize features.
- Per-batch meta-optimization separating support and query samples, with updated using a weighted combination of proxy labels and updated to minimize out-of-sample error on the final target, regularized by entropy (Song et al., 3 Feb 2026).
4. Empirical Evidence on Large-Scale Financial Forecasting
Empirical results on representative Chinese equity universes (CSI 300, CSI 500, CSI 1000) and multiple backbone architectures (LSTM, GRU, TCN, Transformer, SSM, and others) consistently demonstrate the paradox:
- In daily close-to-close prediction, the bi-level method discovers and achieves higher out-of-sample IC, ICIR, and Sharpe ratios than models trained on the final horizon label. For instance, in CSI 300, IC increases from 0.637 to 0.720 (+13%) and ICIR from 0.443 to 0.562 (+27%).
- For intraday 90min prediction, the “hump-shaped” curve is observed: intermediate horizons yield the most stable and informative predictive signals, reflected in higher ICIR and Sharpe.
- Mean-field multi-task training or naive label averaging does not replicate these gains unless the aggregation is restricted to the most informative horizons as discovered by the adaptive process.
This effect is architecture-agnostic and robust under various market regimes, but not universal: if the marginal signal function is strictly increasing within the chosen window, can result (as verified for certain short intraday scenarios), and the bi-level method seamlessly reduces to the standard baseline (Song et al., 3 Feb 2026).
5. Operational Guidelines and Practical Considerations
Practical deployment of label horizon optimization involves:
- Discretizing the forecasting window into candidate horizons,
- Initial mean-aggregation pretraining for stability,
- Random partitioning of mini-batches for inner/outer loop optimization,
- Modest entropy regularization to prevent overfitting to noisy proxy labels,
- Sensitivity analysis for hyperparameters such as entropy weight and warm-up period.
The scheme is computationally efficient, requiring only 5–20% additional training time compared to the final-label baseline, and does not require brute-force sweeps across training configurations (Song et al., 3 Feb 2026).
Key limitations include the reliance on linear APT-based models for intuition, which may inadequately capture extreme high-frequency or highly nonstationary market dynamics, and sensitivity to the precise shape of the realized signal-to-noise curve. For nonstationary domains or abrupt regime shifts, frequent re-optimization of proxy horizons may be necessary.
6. Implications for Financial Machine Learning
The Label Horizon Paradox refocuses attention from model architectures to the structure and timing of supervision signals in sequential prediction tasks. It demonstrates that, in information environments where the process of price discovery and noise accumulation are dynamically imbalanced, judicious selection or online adaptation of training labels can yield persistent generalization benefits. The theoretical apparatus and practical meta-learning framework of Song et al. provide a systematic methodology for identifying and exploiting this structure, opening avenues for future label-centric research in domains characterized by low SNR, delayed information flow, or rapidly changing effective horizons (Song et al., 3 Feb 2026).