Calibrated Multi-Level Quantile Forecasting

Published 29 Dec 2025 in stat.ML, cs.LG, math.OC, and stat.ME | (2512.23671v1)

Abstract: We present an online method for guaranteeing calibration of quantile forecasts at multiple quantile levels simultaneously. A sequence of $α$-level quantile forecasts is calibrated if the forecasts are larger than the target value at an $α$-fraction of time steps. We introduce a lightweight method called Multi-Level Quantile Tracker (MultiQT) that wraps around any existing point or quantile forecaster to produce corrected forecasts guaranteed to achieve calibration, even against adversarial distribution shifts, while ensuring that the forecasts are ordered -- e.g., the 0.5-level quantile forecast is never larger than the 0.6-level forecast. Furthermore, the method comes with a no-regret guarantee that implies it will not worsen the performance of an existing forecaster, asymptotically, with respect to the quantile loss. In experiments, we find that MultiQT significantly improves the calibration of real forecasters in epidemic and energy forecasting problems.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper presents MultiQT, a framework that recalibrates quantile forecasts in real time to achieve multi-level calibration and non-crossing properties.
It leverages isotonic regression and lazy online gradient descent, delivering rigorous non-asymptotic guarantees on calibration error and quantile loss.
Empirical tests on COVID-19 mortality and wind energy forecasting demonstrate significant improvements in forecast accuracy and operational reliability.

Calibrated Multi-Level Quantile Forecasting: Theory and Practice

Introduction and Motivation

Quantile forecasts are a fundamental tool for probabilistic time series prediction and decision-making under uncertainty. A persistent challenge in practical deployment is achieving multi-level calibration—ensuring that, for any quantile level $\alpha$ , the forecasted level is exceeded by the true value at approximately $\alpha$ -fraction of time steps—while simultaneously guaranteeing distributional consistency (i.e., monotonic, non-crossing quantile estimates across levels). The paper "Calibrated Multi-Level Quantile Forecasting" (2512.23671) addresses this by proposing a robust, theoretically grounded, and computationally efficient framework—the Multi-Level Quantile Tracker (MultiQT)—for recalibrating arbitrary quantile forecasters in real-time, with formal distribution-free calibration guarantees that hold even under adversarial or non-stationary distribution shift, and no-regret guarantees with respect to quantile loss.

Problem Setup and Theoretical Foundations

Let $(y_t, x_t)$ be a sequence of targets and covariates. At each time $t$ , a forecaster outputs $\{q_t^\alpha\}_{\alpha\in A}$ , with $A$ a set of quantile levels. The core requirements are:

Calibration: For each $\alpha\in A$ , $\lim_{T\to\infty} \frac{1}{T}\sum_{t=1}^T \mathbf{1}\{y_t \le q_t^\alpha\} = \alpha$ .
No Crossings: For each $t$ , the quantile vector $q_t^{\alpha_1}\le \ldots \le q_t^{\alpha_{|A|}}$ .

The difficulty is that existing online calibration methods for individual quantiles do not extend trivially to the multi-quantile case; naively applying independent single-quantile trackers leads to order violations (quantile crossings), which are statistically incoherent and undesirable for downstream decision making.

To formalize and solve this, the authors introduce a connection to constrained gradient equilibrium in online learning. They show that multi-level calibration with monotonicity is a constrained equilibrium problem, and that lazy online gradient descent with periodic projection onto the isotonic cone (the set of ordered vectors) achieves the joint objective of calibration and monotonicity given mild technical conditions (notably, an "inward flow" property for the loss and constraint set). They provide complete finite-sample and asymptotic calibration guarantees, as well as no-regret quantile loss guarantees.

The Multi-Level Quantile Tracker (MultiQT) Algorithm

The MultiQT wraparound is conceptually simple:

Given a $d$ -dimensional quantile forecaster $\{b_t^\alpha\}$ , maintain a vector of online learned offsets $\theta_t\in\mathbb{R}^d$ .
Each forecast is $q_t = \mathrm{IsoReg}(b_t+\theta_t)$ , i.e., perform isotonic regression to ensure monotonicity.
Update $\theta_t$ using a stochastic approximation step based only on coverage error at each level, evaluated on the actual played forecast (i.e., after projection onto the isotonic cone).
For delayed-feedback settings, appropriately delay the offset update.

Key distinguishing properties:

Distribution-Free Calibration: Calibration error for each quantile decreases to zero under only boundedness of the forecaster residuals; no independence, stationarity, or mixing assumptions are required.
No-Regret Quantile Loss: The cumulative quantile (pinball) loss of MultiQT is asymptotically no worse than any constant offset strategy (including the uncorrected baseline), with explicit uniform error rates and hyperparameter control for aggressiveness.

The authors derive precise non-asymptotic bounds for the calibration error and regret, and identify the fundamental calibration--regret trade-off through the learning rate. Optimal rates and adaptivity to delay are also analytically established.

Empirical Results: Epidemic and Energy Forecasting

The MultiQT method is validated on large-scale real-world forecasting problems in both epidemiology (COVID-19 state-level mortality targets) and energy grid management (day-ahead wind/solar energy forecasting). In both contexts, the baseline forecasts are consistently miscalibrated—often fatally so for risk management purposes—either under- or overestimating uncertainty and frequently producing quantile crossings.

COVID-19 Forecast Calibration

Comparing raw and recalibrated forecasts for COVID-19 deaths in California and other states, MultiQT yields striking improvements in both marginal calibration and cross-quantile coherence.

Figure 1: Raw forecasts and calibration for one-week-ahead COVID-19 deaths, showing substantial deviation from nominal quantile coverage.

Figure 2: Raw forecasts and their calibration, demonstrating biased and inappropriately sharp predictive intervals prior to MultiQT adjustment.

For a comprehensive multi-state, multi-forecaster performance comparison:

Figure 3: Each arrow shows average quantile loss and PIT entropy for raw (tail) and MultiQT-recalibrated (head) $h$ -week-ahead COVID-19 death forecasts, $h\in\{1,2,3,4\}$ ; all points move towards high entropy (uniform PIT, i.e., calibration) and lower loss under MultiQT.

Detailed breakdowns for individual forecasters also demonstrate elimination of systematic under- or over-coverage and resolution of ordering issues:

Figure 4: Part 1/2, California, one-week-ahead: for each forecaster, columns show raw forecasts, MultiQT forecasts, coverage before/after; only MultiQT achieves empirical coverage matching nominal levels everywhere.

Figure 5: Part 2/2, continuation of the above, further supporting improved empirical reliability post-calibration.

Energy Forecasting: Wind Power Calibration

In the energy domain, MultiQT enhances both coverage guarantees and the interpretability of production risk for grid operation. Wind forecast calibration “before” and “after” is visualized as follows:

Figure 6: Wind energy forecast coverage versus nominal, before calibration—with consistent under-coverage and quantile crossings.

Figure 7: Post-calibration, empirical coverage aligns closely with nominal quantile levels, with forecast distributions rendered sharp yet reliable for all time blocks.

This transformation is critical for operational risk management, since ensuring that actual energy shortfalls are no more frequent than predicted by the forecast quantiles is essential for balancing and reserve planning.

Guarantees, Theory, and Algorithmic Insights

The theoretical contributions include:

Sharp Non-Asymptotic Bounds: For each quantile level $\alpha$ and any sequence $y_t$ , the coverage guarantee of MultiQT converges to $\alpha$ at rate $O(T^{-1/2})$ , with explicit dependence on the quantile set cardinality, learning rate, and residual bounds.
Necessity of Isotonic Projection: Sorting or heuristic order-correction is insufficient; only projection onto the isotonic cone ensures calibration for all levels in the presence of adversarial data or persistent crossing events.
Inward Flow Criterion: The convergence and calibration proofs leverage the inward flow property, requiring that the negative gradient at constraint boundaries points into the feasible region, which is always realized for the non-crossing constraint set but not generally satisfied for stricter constraints (e.g., with enforced minimal quantile separation).
No-Regret Loss Minimization: MultiQT’s quantile loss is guaranteed to not exceed that of the base or any constant-correction offset in the constraint set, with regret shrinking as $O(T^{-1/2})$ or better.

These theoretical results are accompanied by rigorous empirical analyses that demonstrate calibration, sharpness, and reliability improvements across domains and timescales, even in the presence of feedback delays or abrupt non-stationarity.

Practical Implications and Future Outlook

MultiQT is fundamentally compositional: it can wrap classical time-series, modern machine learning, or even simple point forecasts, transforming arbitrary uncertainty estimates into reliable probabilistic tools suitable for ensemble forecasting, resource allocation, and decision-making under risk. It is computationally lightweight and trivially parallelizable across quantile sets and time series.

Potential avenues for future research include:

Extension to conditional calibration, enabling reliable distributional forecasts conditional on covariates.
Integration with dynamic error-predictive scorecasting, leveraging predictable structure in forecaster errors for faster adaptation.
Investigation of broader constraint classes, including minimal separation, monotonicity in higher dimensions, or quantile curve smoothness within the lazy gradient descent/inward flow paradigm.
Improved learnable meta-adaptation, e.g., dynamic or data-dependent learning rate scheduling for optimal calibration–sharpness trade-off under uncertainty.

Conclusion

This work resolves an open challenge in probabilistic time-series forecasting: providing a distribution-free, efficient, and theoretically sound procedure for multi-level quantile calibration with distributional consistency. The MultiQT framework delivers strong long-run guarantees, robust empirical performance, and seamless integration with arbitrary base models. These results advance the practical toolkit for uncertainty quantification and point toward further developments in online calibration, robust learning under distribution shift, and reliable AI-driven decision support systems.

Markdown Report Issue