FluSight Challenge: Advancing Flu Forecasting
- FluSight Challenge is an annual CDC-led exercise that evaluates and advances probabilistic influenza forecasting techniques using standardized targets and metrics.
- It integrates diverse data streams and ensemble methods, including machine learning and mechanistic models, to enhance forecast reliability.
- The initiative informs public health decisions by delivering actionable, location-specific predictions that improve outbreak preparedness and resource allocation.
The FluSight Challenge is an annual, prospective influenza forecasting exercise organized by the U.S. Centers for Disease Control and Prevention (CDC), designed to rigorously evaluate and advance probabilistic influenza forecasting methods. It functions as a research platform for methodological innovation, ensemble model development, and actionable public health intelligence by soliciting real-time, location-specific, multi-horizon predictions of influenza dynamics from research teams across the world.
1. Historical Context and Objectives
The challenge emerged with the recognition that actionable, calibrated real-time forecasts of influenza activity could improve resource allocation, public messaging, and mitigation efforts on both national and local scales (Ray et al., 2024, Wattanachit et al., 2022). From its inception in the 2013/14 season, the FluSight Challenge has defined and standardized targets, data sources, and scoring metrics, serving as a testbed for diverse methodological paradigms (mechanistic, statistical, hybrid, and human-in-the-loop models). Although originally focused on influenza-like illness (ILI) percentage as reported by ILINet, the primary target shifted in 2021/22 to weekly counts of laboratory-confirmed influenza hospitalizations, reflecting improvements in data infrastructure and evolving public health priorities (Wadsworth et al., 2024, Ray et al., 2024).
2. Forecast Targets, Data Streams, and Submission Format
Forecasting teams are tasked with probabilistic prediction of multiple public health-relevant targets, at granular spatial and temporal resolutions:
- Short-term dynamics: 1–4 week-ahead ILI percentages (historically) or hospitalization counts.
- Seasonal milestones: Onset week (first sustained exceedance of a region-specific baseline), peak week (timing of maximum activity), and peak intensity (height of maximal observed value) (Santillana et al., 2015, Osthus et al., 2019, Wadsworth et al., 2024).
- Geographies: National, ten HHS regions, 50 states, DC, Puerto Rico.
- Submission format: Discrete probability vectors (for binned targets) or quantile-based distributions (23 quantiles in the hospitalization era), submitted weekly and conforming to precise data and schema requirements.
Data utilized encompass both traditional surveillance (ILINet, laboratory-confirmed hospitalizations, participatory surveillance, electronic health records) and diverse digital streams (Google Trends, Twitter, crowd-sourced platforms) (Santillana et al., 2015, Santillana et al., 2015, Dong et al., 2019, Ray et al., 2024).
3. Methodological Innovation: Model Classes
The FluSight Challenge has catalyzed the development and rigorous comparison of a wide spectrum of modeling frameworks:
- Mechanistic epidemic models: SIR-type models, hierarchical compartmental models with process and discrepancy terms, and renewal-type dynamic systems for hospitalizations (Osthus et al., 2017, Wadsworth et al., 2024, Aawar et al., 2022).
- Empirical Bayes and semiparametric approaches: Historical-curve transformation, dynamic borrowing of seasonal structure, and flexible prior modeling without strict mechanistic assumptions (Brooks et al., 2014, Osthus et al., 2017).
- Machine learning and statistical models: Dynamic regression on digital signals, support vector regression, random forests of mechanistic predictors, and gradient boosting quantile regression (Santillana et al., 2015, Aawar et al., 2022, Ray et al., 2024).
- Hybrid and multi-source fusion: Automated frameworks that jointly model multiple surveillance streams, incorporate multiple locations, and share feature space across signals to leverage data richness under constraints of signal scarcity (Ray et al., 2024).
- Ensemble methods: Multi-model ensembles—linear pools, beta-transformed or mixture ensembles, cluster-aggregate-pool (CAP) algorithms, Bayesian stacking, and winner-takes-all frameworks—have become central for combining diverse model outputs and correcting for miscalibration or redundancy (Wattanachit et al., 2022, Wei et al., 2023, Wadsworth et al., 4 Sep 2025).
4. Ensemble Forecast Construction and Calibration
A core principle of the FluSight Challenge is the reliance on ensemble models to improve robustness and forecast skill. Several ensemble methodologies have been canonicalized:
| Ensemble Method | Description/Citation | Calibration Feature |
|---|---|---|
| Equal-Weight Linear Pool | Unweighted average of components (Wattanachit et al., 2022) | Overdispersion prone |
| Linear Pool (LP) | Optimal weights (MLE) (Wattanachit et al., 2022) | Overdispersion present |
| Beta-Transformed/BLP | Modifies pooled CDF via Beta transform (Wattanachit et al., 2022) | Corrects overdispersion, mild underprediction |
| Finite Beta Mixture (BMC_K) | Mixture-of-beta transforms (Wattanachit et al., 2022) | Further flexibility |
| Cluster-Aggregate-Pool (CAP) | Cluster, aggregate, then pool (Wei et al., 2023) | Handles redundancy, improves calibration |
| Bayesian Stacking (SGP) | Gibbs posterior for stacking over proper scores (Wadsworth et al., 4 Sep 2025) | Regularizes weights |
Calibration, measured via Probability Integral Transform (PIT) and Expected Calibration Error (ECE), is a central evaluation criterion, with improvements of 10% typical under CAP or BLP approaches relative to classical linear pools (Wattanachit et al., 2022, Wei et al., 2023). Adaptive weighting, regularization toward equal weighting (shrinkage), and the use of proper scoring rules such as Weighted Interval Score (WIS), CRPS, and strict log score play a vital role in ensemble optimization and model selection (Bracher, 2019, Wadsworth et al., 4 Sep 2025).
5. Scoring Rules and Evaluation Metrics
Forecasts are evaluated on both accuracy and calibration using strictly defined, frequently proper, scoring metrics:
- Logarithmic score (LogS): Log of probability assigned to the correct bin (strictly proper). The CDC’s multibin variant (MBlogS), while designed to reward proximity, is not strictly proper and may incentivize hedging (Bracher, 2019).
- Weighted Interval Score (WIS) and Continuous Ranked Probability Score (CRPS): Proper scores that generalize MAE to probabilistic forecasts (Wadsworth et al., 4 Sep 2025, Wadsworth et al., 2024).
- Probability Integral Transform (PIT) and ECE: Calibration assessments (Wattanachit et al., 2022, Wei et al., 2023).
- Cost-loss Value Score (VS): Decision-theoretic, asymmetric metric quantifying utility for decision-makers by comparing cost- and loss-weighted outcomes of forecasts to climatology and perfect information (Gerlee et al., 9 Jan 2026).
Empirical studies show that while proper scoring rules underpin forecast honesty and sharpness, context-sensitive metrics like cost-loss VS are critical for aligning forecast utility with operational decision requirements (Gerlee et al., 9 Jan 2026).
6. Recent Advances and Key Results
Recent FluSight cycles have been marked by several methodologically significant developments and empirical findings:
- Data fusion and transfer learning: Joint training on multiple signals and locations (multi-task boosting) is the dominant skill driver in settings with limited historical target data (Ray et al., 2024).
- Hybrid and discrepancy modeling: Flexible discrepancy components (reverse random walk on weekly effects) are critical for short-range skill, especially surrounding holiday effects and peak periods (Wadsworth et al., 2024).
- Automated mechanistic–machine learning hybridization: Tree-based ensembles over parameterized mechanistic models can outperform both hand-tuned mechanistic predictors and pure machine learning baselines, yielding superior WIS/MAE while automating tuning (Aawar et al., 2022).
- Human–machine hybrid ensembles: Chimeric models that incorporate human judgment densities (on peak timing or intensity) with standard surveillance-driven models yield improved long-horizon forecasts, and convex-mixture spatial mappings allow transplantation of human forecasts from a small sample to all states (McAndrew et al., 2024).
- Calibration-aware pooling: Beta-transformed linear pools and finite beta mixtures offer both sharpness and properly calibrated intervals, outperforming standard linear combinations for week-ahead and seasonal targets (Wattanachit et al., 2022).
- Winner-take-all and stacking innovations: Adaptive ensemble selection (e.g., winner-takes-all ARGOX-Joint-Ensemble) and Bayesian stacking with regularized Gibbs posteriors yield consistent improvements over classical model-averaging for both discrete and quantile-based targets (Ma et al., 2022, Wadsworth et al., 4 Sep 2025).
Notably, automation, ensemble transparency, and scalable computational pipelines are viewed as prerequisites for robust, reproducible deployment on FluSight’s real-time schedule (Aawar et al., 2022, Ray et al., 2024).
7. Impact and Future Directions
The FluSight Challenge continues to shape the landscape of infectious disease forecasting, influencing surveillance infrastructure, public health planning, and methodological research. Notable impacts include:
- Method harmonization: Standardized targets, data formats, and scoring protocols have catalyzed the development of benchmarkable models.
- Open comparative evaluation: Retrospective and real-time head-to-head evaluations drive incremental and major methodological advances.
- Adoption of decision-analytic metrics: Context-aware scoring, such as cost-loss VS, is foregrounded for real-world utility (Gerlee et al., 9 Jan 2026).
- Hybrid and crowd-based approaches: Chimeric and crowd-judgment models demonstrate that integrating tacit knowledge and surveillance data can measurably improve forecast skill, especially beyond operational horizons of mechanistic/statistical models (McAndrew et al., 2024).
- Calibration as a central goal: Ensembles and calibration-correction approaches have become prerequisites given limitations of individual and naive pooling techniques (Wattanachit et al., 2022, Wei et al., 2023).
Ongoing frontiers include deep integration of multi-signal data sources (lab-confirmed, syndromic, digital, participatory), spatially coherent hierarchical forecasting, expansion to co-circulating pathogens (influenza + COVID-19), and explicit modeling of surveillance noise, resource constraints, and intervention feedback (Ray et al., 2024, Ma et al., 2022).
References:
(Santillana et al., 2015, Santillana et al., 2015, Osthus et al., 2017, Dong et al., 2019, Osthus et al., 2019, Bracher, 2019, Wattanachit et al., 2022, Ma et al., 2022, Aawar et al., 2022, Wei et al., 2023, Ray et al., 2024, McAndrew et al., 2024, Wadsworth et al., 2024, Wadsworth et al., 4 Sep 2025, Gerlee et al., 9 Jan 2026, Brooks et al., 2014)