Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance

Published 31 Mar 2026 in cs.CV and cs.LG | (2603.29941v1)

Abstract: Uncertainty Quantification (UQ) is crucial for ensuring the reliability of automated image segmentations in safety-critical domains like biomedical image analysis or autonomous driving. In segmentation, UQ generates pixel-wise uncertainty scores that must be aggregated into image-level scores for downstream tasks like Out-of-Distribution (OoD) or failure detection. Despite routine use of aggregation strategies, their properties and impact on downstream task performance have not yet been comprehensively studied. Global Average is the default choice, yet it does not account for spatial and structural features of segmentation uncertainty. Alternatives like patch-, class- and threshold-based strategies exist, but lack systematic comparison, leading to inconsistent reporting and unclear best practices. We address this gap by (1) formally analyzing properties, limitations, and pitfalls of common strategies; (2) proposing novel strategies that incorporate spatial uncertainty structure and (3) benchmarking their performance on OoD and failure detection across ten datasets that vary in image geometry and structure. We find that aggregators leveraging spatial structure yield stronger performance in both downstream tasks studied. However, the performance of individual aggregators depends heavily on dataset characteristics, so we (4) propose a meta-aggregator that integrates multiple aggregators and performs robustly across datasets.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces spatially-aware aggregation methods that enhance uncertainty representation by capturing localized structural features.
It benchmarks traditional intensity-based methods against a robust meta-aggregation (GMM-All), demonstrating superior AUROC performance for out-of-distribution detection.
The study provides actionable insights for improving failure detection and safety in segmentation applications through multi-feature aggregation.

Spatially-Aware Aggregation of Segmentation Uncertainty: Analysis and Benchmarking

Introduction and Motivation

Uncertainty Quantification (UQ) for image segmentation underpins model reliability in critical domains, such as medical image analysis and autonomous driving, where downstream decisions often rely on aggregated pixelwise uncertainties. Traditionally, the global average (AVG) is the default aggregation strategy (AggS) to summarize pixelwise uncertainties into a scalar required for tasks such as out-of-distribution (OoD) and failure detection. However, AVG and similar methods operate agnostic to spatial and structural characteristics of uncertainty, potentially suppressing informative localized or edge-centric uncertainty present in many real-world scenarios.

"Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance" (2603.29941) provides a formal theoretical and empirical analysis of existing and newly proposed aggregation strategies. The study benchmarks these strategies for OoD and failure detection across a broad spectrum of datasets and proposes a robust meta-aggregation approach that integrates both intensity-based and spatially-aware features.

Aggregation Strategies: Typology and Theoretical Properties

The paper categorizes AggSs into several families, with a focus on intensity-based (pixelwise) and prediction-based strategies, and introduces novel spatially-aware aggregates. The limitations of standard approaches—such as AVG’s inability to distinguish between spatial patterns of uncertainty, and the non-monotonicity and lack of proportion-invariance of various alternatives—are examined both conceptually and empirically.

Figure 1: Illustration of aggregation strategies, their empirical performance, and characteristic limitations, including the insensitivity of AVG and non-monotonicity of threshold-based methods.

Intensity-based strategies (e.g., AVG, Above-Threshold Average ATA, Above-Quantile Average AQA, Patch-Level Maximum PLM) condense the pixelwise uncertainty map $U$ to a scalar via simple operations. Prediction-aware strategies leverage the predicted segmentation ( $M$ ) to compute class-level or foreground-fraction-aware means (BCA, ICA, QFR), conferring improved proportion-invariance and sensitivity to relevant object regions.

The authors highlight, for instance, the non-monotonicity of ATA—where uniformly increasing all uncertainties may result in a discontinuous decrease in the aggregated score as more pixels cross the threshold—thus undermining the fidelity of uncertainty ranking. Such properties are formally analyzed and visualized.

Figure 3: Demonstration of ATA's non-monotonicity: a uniform increment of uncertainty across all map pixels can paradoxically lower the aggregated score.

Spatial AggSs leverage established statistical measures for local structure:

Moran’s I (measuring spatial autocorrelation),
Edge Density Score (EDS) (density of edge-localized uncertainty), and
Shannon Entropy (local randomness).

The spatial mass ratio (SMR) formalism quantifies the fraction of uncertainty mass concentrated within high-structure regions, delivering interpretable values in $[0,1]$ . These spatial measures are shown to be sensitive to clustered, edge-concentrated, or noisy uncertainty—enabling discrimination of distributional or semantic shifts often missed by intensity-only methods.

Figure 5: Spatial decomposition of uncertainty maps via Moran's I, highlighting the structural concentration of uncertainty mass.

Datasets and Empirical Diversity

The benchmark encompasses ten datasets spanning medical images, histopathology, urban street scenes, multispectral crop images, and microscopy. These are selected to maximize diversity in object geometry, texture, and distributional shift types (covariate, semantic). The spatial diversity of the resulting uncertainty maps is characterized using projections into the (MOR, EDS) plane, evidencing clear separability of datasets and underscoring the importance of spatial features for OoD detection.

Figure 6: Projection of dataset uncertainty maps into the spatial (MOR, EDS) space, revealing distinct structural clustering and edge-localized patterns in OoD versus in-distribution (IID) data.

Benchmarking Aggregation Strategies for Downstream Tasks

Out-of-Distribution Detection

Effectiveness is quantified via AUROC, measuring the separation of IID and OoD samples. The study demonstrates that prediction-aware (BCA, ICA) and the meta-aggregation approach (GMM-All) consistently exhibit statistically significant superior performance relative to traditional pixelwise averages, particularly in datasets where uncertainty is structurally or semantically localized.

Figure 7: Aggregation strategy AUROC rankings for OoD detection; GMM-based and prediction-aware strategies form the statistically top tier.

A key empirical finding is the marked failure of AVG-based approaches in 6/10 benchmark settings, at times approximating random performance. In contrast, spatially-sensitive AggSs and GMM-based meta-aggregation more reliably separate IID and OoD cases—especially when the OoD shift manifests in localized uncertainty structures.

Robustness and Meta-Aggregation via GMM

The authors introduce a strategy (GMM-All), fitting a Gaussian Mixture Model over the combined feature space constructed from several individual AggS outputs per sample. This meta-aggregation mechanism robustly captures multimodal structure and preserves complementarity between diverse aggregate features.

Ablation studies reveal that utilizing all features provides equal or improved AUROC compared to reduced or individual-feature GMMs in the majority of benchmarks, except where a single feature dominates discriminative power. SHAP analysis of the GMM indicates dataset-specific dominating contributors (e.g., EDS for urban scene shifts), reinforcing the need for multi-feature aggregation.

Figure 2: Analysis of GMM meta-aggregation robustness, including individual feature ablations and SHAP value attribution for discriminative AggSs by dataset.

Failure Detection

For failure detection, Selective Classification metrics (E-AURC) are employed. Again, traditional AVG underperforms across most datasets, often ranking at the bottom. Prediction-aware AggSs (QFR, BCA) and GMM meta-aggregates achieve superior alignment between uncertainty and segmentation errors, minimizing excess risk when discarding high-uncertainty predictions.

Figure 4: Selective risk-coverage and E-AURC benchmarking for failure detection; QFR, BCA, and GMM-based AggSs consistently achieve lower risk.

Notably, object-size-independent thresholding (QFR) excels in targeting boundary-associated errors, outperforming class-proportion-weighted averages (ICA) which suffer in small-object-centric scenarios.

Broader Implications and Limitations

The comprehensive analysis demonstrates that no single AggS is universally preferable: the optimal aggregation function is highly task- and dataset-dependent, primarily reflecting the foreground structure and the spatial distribution of uncertainty. Reliance on AVG as a default is explicitly contraindicated by empirical evidence. Instead, incorporating spatially-aware and prediction-sensitive aggregation, or adopting meta-aggregation (GMM-All) in the absence of detailed dataset knowledge, is substantiated as a robust practical recommendation.

Practical implications include:

Enhanced reliability for safety-critical deployment of segmentation models by improved detection of risky (OoD or error-prone) predictions.
Generalizable, model-agnostic meta-aggregation pipelines that can flexibly accommodate heterogeneous uncertainty patterns.
Foundations for future development of automated aggregation-selection policies, and for interpretable, context-sensitive downstream use of segmentation uncertainty.

Limitations are acknowledged, such as the statistical constraints of GMMs in low-data or high-dimensional settings and the potential benefit of alternative copula-based meta-aggregation frameworks. Further, extensions to 3D segmentation, multi-modal data, and integration of additional uncertainty modalities (e.g., energy-based methods) constitute promising directions for subsequent research.

Conclusion

This work rigorously establishes that spatially-aware aggregation of segmentation uncertainty substantially outperforms intensity-based baselines in both out-of-distribution and failure detection tasks. The aggregation strategy should be carefully selected to respect dataset-specific spatial structure; where this is infeasible, meta-aggregation across diverse features is empirically validated as a robust solution. These findings provide actionable guidance for optimizing the reliability of uncertainty-informed segmentation systems and motivate continued development of both theoretical and meta-learning-based aggregation strategies in computer vision UQ.

Markdown Report Issue