Brier Skill Score in Forecasting

Updated 3 February 2026

Brier Skill Score is a metric that evaluates probabilistic forecasts by comparing predicted probabilities with observed outcomes against a climatological baseline.
It decomposes mean forecast error into reliability, resolution, and uncertainty, providing insight into forecast calibration and practical model improvements.
BSS is applied in operational settings such as solar flare prediction, where enhanced calibration and bias correction optimize decision thresholds and risk analysis.

The Brier Skill Score (BSS) is a quantitative metric for evaluating the performance of probabilistic forecasts relative to a reference forecast, typically climatology. BSS is widely used in the verification of event probability forecasts—such as solar flare prediction—to quantify gains in forecast accuracy and reliability versus a naïve, frequency-based baseline. By construction, BSS provides a standardized, skill-relative evaluation that decomposes mean probabilistic forecast error into contributions from calibration (reliability), forecast sharpness (resolution), and intrinsic climatological uncertainty. Major reference treatments of BSS in operational probabilistic verification include Nishizuka et al. (2020) and Leka et al. (2018) (Nishizuka et al., 2020, &&&1&&&).

1. Formal Definition and Mathematical Formulation

Let $N$ denote the number of independent forecasts, $p(y_n)$ the predicted probability for event $n$ , and $y_n^* \in \{0,1\}$ the binary observation indicator. The Brier Score (BS) is defined as

$\mathrm{BS} = \frac{1}{N} \sum_{n=1}^N (p(y_n) - y_n^*)^2$

The reference, or climatological, Brier Score ( $\mathrm{BS}_c$ ) is computed by substituting the climatological event rate $f$ (empirically the frequency of event occurrence in the absence of informative prediction) for each probability forecast:

$\mathrm{BS}_c = \frac{1}{N} \sum_{n=1}^N (f - y_n^*)^2$

The Brier Skill Score is then

$\mathrm{BSS} = 1 - \frac{\mathrm{BS}}{\mathrm{BS}_c} = \frac{\mathrm{BS}_c - \mathrm{BS}}{\mathrm{BS}_c}$

A BSS value of 1 indicates perfect forecasts (BS = 0), zero indicates parity with climatology, and negative values indicate less skill than climatology (Nishizuka et al., 2020, McCloskey et al., 2018).

2. Decomposition: Reliability, Resolution, and Uncertainty

BS can be decomposed following Murphy's separation into three terms: reliability, resolution, and uncertainty. Let the continuous probability forecasts $f_i$ be grouped into $K$ bins, each containing $n_k$ forecasts, with $f_k$ the mean forecast probability and $\overline{o}_k$ the observed event frequency in bin $k$ . Climatology is given by $\overline{o}$ , the overall event frequency.

$\mathrm{BS} = \underbrace{\frac{1}{N}\sum_{k=1}^K n_k (f_k - \overline{o}_k)^2}_{\text{Reliability}} - \underbrace{\frac{1}{N}\sum_{k=1}^K n_k (\overline{o}_k - \overline{o})^2}_{\text{Resolution}} + \underbrace{\overline{o}(1-\overline{o})}_{\text{Uncertainty}}$

Rewriting BSS in these terms:

$\mathrm{BSS} = \frac{\text{Resolution} - \text{Reliability}}{\text{Uncertainty}}$

High reliability (small squared distance between predicted and observed event frequencies within bins) and high resolution (deviation of observed binwise frequencies from the climatological mean) increase BSS (McCloskey et al., 2018).

3. Reference Forecast and Climatology in BSS Calculation

The reference or climatological forecast involves always issuing the empirical event frequency as the predicted probability for each event. For instance, in Nishizuka et al. (2020), for solar flares exceeding M-class, $f_m \approx 0.032$ ; for $\geq$ C-class, $f_c \approx 0.196$ (Nishizuka et al., 2020). These reference rates are derived from the total training and testing samples. In Leka et al. (2018), analogous climatological frequencies are computed from the test period as $\overline{o}_{\rm C1} \approx 0.146$ and $\overline{o}_{\rm M1} \approx 0.038$ (McCloskey et al., 2018).

The reference forecast forms the baseline against which all forecast systems are evaluated. BSS=0 denotes no improvement over this baseline; negative BSS denotes inferior performance.

4. Application and BSS Results in Operational Solar Flare Forecasting

BSS serves as a primary performance metric in solar flare probabilistic forecast verification. Nishizuka et al. (2020) report, for the Deep Flare Net-Reliable (DeFN-R) DNN, BSS $\approx$ 0.30 for $\geq$ M-class and BSS $\approx$ 0.41 for $\geq$ C-class flare predictions, corresponding to 30–41% reduction in mean-squared probability error relative to the constant climatological forecast. The practical implication is that forecast probabilities are well-calibrated: predicted probabilities can be used directly for threshold decision-making with reliable interpretations and robust discrimination (ROC AUC $\simeq$ 0.93–0.96 for $\geq$ M, $\sim$ 0.89 for $\geq$ C) (Nishizuka et al., 2020).

Leka et al. (2018) compare static versus evolution-dependent Poisson forecast methods using McIntosh sunspot classifications. For $\geq$ C1.0 flares, the evolution-dependent method yields BSS $=0.09$ (outperforming climatology and static forecasts, which achieve BSS $=-0.09$ ). Bias correction via empirical rate scaling further improves BSS (to $0.20$ for evolution-dependent forecasts), demonstrating the critical interaction of calibration and cycle-specific base rates on skill assessments (McCloskey et al., 2018).

Table: Empirically Reported BSS Values

Study	Task	Method	BSS
(Nishizuka et al., 2020)	$\geq$ C-class flare	DeFN-R (DNN)	0.41
(Nishizuka et al., 2020)	$\geq$ M-class flare	DeFN-R (DNN)	0.30
(McCloskey et al., 2018)	$\geq$ C1.0 flare	Static Poisson	–0.09
(McCloskey et al., 2018)	$\geq$ C1.0 flare	Evolution-Poisson	0.09
(McCloskey et al., 2018)	$\geq$ C1.0 flare	Evo-Poisson (corr.)	0.20

5. Forecast Calibration, Over-forecasting, and Bias Correction

Misalignment between forecast probabilities and observed event frequencies leads to reliability errors, reducing BSS. Systematic over-forecasting—in which forecasts systematically exceed realized frequencies—lowers reliability and results in negative or negligible BSS. Leka et al. (2018) observe that over-forecasting in Solar Cycle 23 arises from training-period Poisson rates not accounting for a cycle-to-cycle drop in true flare rates (80% and 50% for C- and M-classes, respectively, between cycles). Applying a bias-correction scaling factor brings forecast probabilities closer to observed rates, sharply reducing the reliability penalty and increasing BSS while keeping resolution fixed (McCloskey et al., 2018).

In DNN-based approaches such as DeFN-R, calibration is further enhanced by removing class-value weighting in the loss function and directly training for probabilistic accuracy. This yields near-diagonal reliability diagrams and maximizes BSS without loss of ROC performance (Nishizuka et al., 2020).

6. Threshold Selection and Decision-analytic Implications

For probability models optimized under BSS, the operational forecast threshold may be set at the climatological event rate, maximally leveraging calibration: for DeFN-R, the optimal threshold occurs near $P_\mathrm{th} = f$ , which also maximizes the True Skill Statistic (TSS) under well-calibrated conditions (Nishizuka et al., 2020). This decouples the threshold from arbitrary conventions (such as $P_\mathrm{th}=0.5$ ) and allows users to set operational parameters according to risk tolerance, application constraints, or desired trade-off between false alarms and missed events.

A well-calibrated, high-BSS forecast system allows direct interpretability of predicted probabilities: stated model confidence aligns robustly with event realization frequencies, enabling reliable downstream applications.

7. Contextual Role of BSS in Model Development and Verification

BSS is explicitly adopted as an optimization target in recent DNN-based probabilistic forecasting for solar phenomena. Nishizuka et al. (2020) tune model hyperparameters, architectural details (e.g., batch normalization, skip connections), and training strategies to maximize BSS, thereby prioritizing both calibration and discrimination. Classical deterministic systems (optimizing thresholded classification scores) do not in general maximize BSS and may exhibit poor reliability diagrams.

Decomposition of BSS enables rigorous diagnostic analysis of model errors: reliability can be improved by post-hoc calibration or architectural/algorithmic changes, while resolution is mainly dependent on the ensemble's ability to meaningfully stratify risk between cases. The sum of these effects, normalized by climatological uncertainty, is directly captured by BSS—supporting transparent benchmarking across probabilistic systems and forecast applications (Nishizuka et al., 2020, McCloskey et al., 2018).

Markdown Report Issue Upgrade to Chat

References (2)

Reliable Probability Forecast of Solar Flares: Deep Flare Net-Reliable (DeFN-R) (2020)

Flare Forecasting Using the Evolution of McIntosh Sunspot Classifications (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Brier Skill Score (BSS).