Brier Skill Score in Forecasting
- Brier Skill Score is a metric that evaluates probabilistic forecasts by comparing predicted probabilities with observed outcomes against a climatological baseline.
- It decomposes mean forecast error into reliability, resolution, and uncertainty, providing insight into forecast calibration and practical model improvements.
- BSS is applied in operational settings such as solar flare prediction, where enhanced calibration and bias correction optimize decision thresholds and risk analysis.
The Brier Skill Score (BSS) is a quantitative metric for evaluating the performance of probabilistic forecasts relative to a reference forecast, typically climatology. BSS is widely used in the verification of event probability forecasts—such as solar flare prediction—to quantify gains in forecast accuracy and reliability versus a naïve, frequency-based baseline. By construction, BSS provides a standardized, skill-relative evaluation that decomposes mean probabilistic forecast error into contributions from calibration (reliability), forecast sharpness (resolution), and intrinsic climatological uncertainty. Major reference treatments of BSS in operational probabilistic verification include Nishizuka et al. (2020) and Leka et al. (2018) (Nishizuka et al., 2020, &&&1&&&).
1. Formal Definition and Mathematical Formulation
Let denote the number of independent forecasts, the predicted probability for event , and the binary observation indicator. The Brier Score (BS) is defined as
The reference, or climatological, Brier Score () is computed by substituting the climatological event rate (empirically the frequency of event occurrence in the absence of informative prediction) for each probability forecast:
The Brier Skill Score is then
A BSS value of 1 indicates perfect forecasts (BS = 0), zero indicates parity with climatology, and negative values indicate less skill than climatology (Nishizuka et al., 2020, McCloskey et al., 2018).
2. Decomposition: Reliability, Resolution, and Uncertainty
BS can be decomposed following Murphy's separation into three terms: reliability, resolution, and uncertainty. Let the continuous probability forecasts be grouped into bins, each containing forecasts, with the mean forecast probability and the observed event frequency in bin . Climatology is given by , the overall event frequency.
Rewriting BSS in these terms:
High reliability (small squared distance between predicted and observed event frequencies within bins) and high resolution (deviation of observed binwise frequencies from the climatological mean) increase BSS (McCloskey et al., 2018).
3. Reference Forecast and Climatology in BSS Calculation
The reference or climatological forecast involves always issuing the empirical event frequency as the predicted probability for each event. For instance, in Nishizuka et al. (2020), for solar flares exceeding M-class, ; for C-class, (Nishizuka et al., 2020). These reference rates are derived from the total training and testing samples. In Leka et al. (2018), analogous climatological frequencies are computed from the test period as and (McCloskey et al., 2018).
The reference forecast forms the baseline against which all forecast systems are evaluated. BSS=0 denotes no improvement over this baseline; negative BSS denotes inferior performance.
4. Application and BSS Results in Operational Solar Flare Forecasting
BSS serves as a primary performance metric in solar flare probabilistic forecast verification. Nishizuka et al. (2020) report, for the Deep Flare Net-Reliable (DeFN-R) DNN, BSS0.30 for M-class and BSS0.41 for C-class flare predictions, corresponding to 30–41% reduction in mean-squared probability error relative to the constant climatological forecast. The practical implication is that forecast probabilities are well-calibrated: predicted probabilities can be used directly for threshold decision-making with reliable interpretations and robust discrimination (ROC AUC0.93–0.96 for M, 0.89 for C) (Nishizuka et al., 2020).
Leka et al. (2018) compare static versus evolution-dependent Poisson forecast methods using McIntosh sunspot classifications. For C1.0 flares, the evolution-dependent method yields BSS (outperforming climatology and static forecasts, which achieve BSS). Bias correction via empirical rate scaling further improves BSS (to $0.20$ for evolution-dependent forecasts), demonstrating the critical interaction of calibration and cycle-specific base rates on skill assessments (McCloskey et al., 2018).
Table: Empirically Reported BSS Values
| Study | Task | Method | BSS |
|---|---|---|---|
| (Nishizuka et al., 2020) | C-class flare | DeFN-R (DNN) | 0.41 |
| (Nishizuka et al., 2020) | M-class flare | DeFN-R (DNN) | 0.30 |
| (McCloskey et al., 2018) | C1.0 flare | Static Poisson | –0.09 |
| (McCloskey et al., 2018) | C1.0 flare | Evolution-Poisson | 0.09 |
| (McCloskey et al., 2018) | C1.0 flare | Evo-Poisson (corr.) | 0.20 |
5. Forecast Calibration, Over-forecasting, and Bias Correction
Misalignment between forecast probabilities and observed event frequencies leads to reliability errors, reducing BSS. Systematic over-forecasting—in which forecasts systematically exceed realized frequencies—lowers reliability and results in negative or negligible BSS. Leka et al. (2018) observe that over-forecasting in Solar Cycle 23 arises from training-period Poisson rates not accounting for a cycle-to-cycle drop in true flare rates (80% and 50% for C- and M-classes, respectively, between cycles). Applying a bias-correction scaling factor brings forecast probabilities closer to observed rates, sharply reducing the reliability penalty and increasing BSS while keeping resolution fixed (McCloskey et al., 2018).
In DNN-based approaches such as DeFN-R, calibration is further enhanced by removing class-value weighting in the loss function and directly training for probabilistic accuracy. This yields near-diagonal reliability diagrams and maximizes BSS without loss of ROC performance (Nishizuka et al., 2020).
6. Threshold Selection and Decision-analytic Implications
For probability models optimized under BSS, the operational forecast threshold may be set at the climatological event rate, maximally leveraging calibration: for DeFN-R, the optimal threshold occurs near , which also maximizes the True Skill Statistic (TSS) under well-calibrated conditions (Nishizuka et al., 2020). This decouples the threshold from arbitrary conventions (such as ) and allows users to set operational parameters according to risk tolerance, application constraints, or desired trade-off between false alarms and missed events.
A well-calibrated, high-BSS forecast system allows direct interpretability of predicted probabilities: stated model confidence aligns robustly with event realization frequencies, enabling reliable downstream applications.
7. Contextual Role of BSS in Model Development and Verification
BSS is explicitly adopted as an optimization target in recent DNN-based probabilistic forecasting for solar phenomena. Nishizuka et al. (2020) tune model hyperparameters, architectural details (e.g., batch normalization, skip connections), and training strategies to maximize BSS, thereby prioritizing both calibration and discrimination. Classical deterministic systems (optimizing thresholded classification scores) do not in general maximize BSS and may exhibit poor reliability diagrams.
Decomposition of BSS enables rigorous diagnostic analysis of model errors: reliability can be improved by post-hoc calibration or architectural/algorithmic changes, while resolution is mainly dependent on the ensemble's ability to meaningfully stratify risk between cases. The sum of these effects, normalized by climatological uncertainty, is directly captured by BSS—supporting transparent benchmarking across probabilistic systems and forecast applications (Nishizuka et al., 2020, McCloskey et al., 2018).