UAGLNet: Uncertainty-Aggregated Global-Local Fusion Network with Cooperative CNN-Transformer for Building Extraction

Published 15 Dec 2025 in cs.CV | (2512.12941v1)

Abstract: Building extraction from remote sensing images is a challenging task due to the complex structure variations of the buildings. Existing methods employ convolutional or self-attention blocks to capture the multi-scale features in the segmentation models, while the inherent gap of the feature pyramids and insufficient global-local feature integration leads to inaccurate, ambiguous extraction results. To address this issue, in this paper, we present an Uncertainty-Aggregated Global-Local Fusion Network (UAGLNet), which is capable to exploit high-quality global-local visual semantics under the guidance of uncertainty modeling. Specifically, we propose a novel cooperative encoder, which adopts hybrid CNN and transformer layers at different stages to capture the local and global visual semantics, respectively. An intermediate cooperative interaction block (CIB) is designed to narrow the gap between the local and global features when the network becomes deeper. Afterwards, we propose a Global-Local Fusion (GLF) module to complementarily fuse the global and local representations. Moreover, to mitigate the segmentation ambiguity in uncertain regions, we propose an Uncertainty-Aggregated Decoder (UAD) to explicitly estimate the pixel-wise uncertainty to enhance the segmentation accuracy. Extensive experiments demonstrate that our method achieves superior performance to other state-of-the-art methods. Our code is available at https://github.com/Dstate/UAGLNet

Abstract PDF Upgrade to Chat

Summary

The paper introduces a cooperative encoder that synergizes CNN and transformer layers for enhanced global-local feature integration.
It leverages multi-kernel feature modulation and an uncertainty-aggregated decoder to refine pixel-level building segmentation.
Empirical results demonstrate improved IoU and F1 scores across key datasets, confirming its efficiency and robustness.

UAGLNet: Uncertainty-Aggregated Global-Local Fusion Network with Cooperative CNN-Transformer for Building Extraction

Introduction

The presented work introduces UAGLNet, an advanced deep learning architecture targeting building extraction from high-resolution remote sensing (RS) imagery. The challenge in building extraction stems from complex structure variations, occlusion, and inherent ambiguity in RS images. UAGLNet addresses deficiencies in prior CNN/ViT and hybrid approaches related to the feature pyramid gap and inadequate global-local feature integration by synergizing hybrid CNN and transformer layers, guided by uncertainty modeling. The proposal consists of three core innovations: (a) a Cooperative Encoder architecture employing hybrid CNN-transformer operators; (b) a Global-Local Fusion module facilitating hierarchical integration; and (c) an Uncertainty-Aggregated Decoder providing pixel-wise uncertainty-aware refinement.

Methodological Framework

Cooperative Encoder Architecture

UAGLNet's backbone employs a four-stage cooperative encoder (CE) that hierarchically partitions feature learning between CNN blocks (early stages) and transformer blocks (deep stages). This partitioning exploits multi-kernel feature modulation (MKFM) for spatially diverse local feature extraction. In intermediate stages, the Cooperative Interaction Block (CIB) enables global-local interaction, featuring stacked hybrid convolutional operations interleaved with multi-head self-attention mechanisms. Residual connections and depth-wise separable convolutions maintain computational tractability.

Figure 1: (a) Multi-kernel feature modulation for local information; (b) CIB structure, stacking CNN and transformer layers for global-local information interaction.

The pipeline refines feature pyramids, balancing fine-scale localization and broad contextual semantics, and narrows the global-local gap by dynamically merging multi-scale receptive fields.

Global-Local Fusion (GLF)

GLF constructs complementary global and local representations, leveraging both CIB outputs and terminal MHSA features. Hierarchical features are upsampled, convolved, and concatenated to enforce rich propagation pathways. This stratified fusion mitigates detail loss, sustains object boundary integrity, and provides the encoder-decoder bridge with maximally expressive features for subsequent uncertainty modeling.

Uncertainty-Aggregated Decoder (UAD)

UAD introduces explicit pixel-wise uncertainty estimation, decomposing outputs into local ( $\mathbf{U}_L$ ) and global ( $\mathbf{U}_G$ ) uncertainty maps. Through Gaussian variational sampling and reparameterization, the model quantifies prediction turbulence at the pixel level. These uncertainty maps act as attenuation weights for feature integration, effectively suppressing ambiguous foreground predictions and enhancing segmentation robustness in challenging regions. The loss function is a weighted sum of dice, binary cross-entropy, boundary, and Kullback-Leibler divergence terms, ensuring tight supervision over both segmentation and uncertainty fidelity.

Empirical Results and Analysis

Quantitative Performance

UAGLNet demonstrates superior numerical results across Massachusetts, Inria, and WHU building datasets, outperforming contemporary CNN, transformer, and hybrid models. On the Massachusetts dataset, it delivers an IoU of 76.97%, F1 score of 86.99%, precision of 88.28%, and recall of 85.73%, surpassing BuildFormer and UANet by measurable margins. On the Inria Aerial dataset, it achieves 83.74% IoU and 91.15% F1, requiring only 28.90G FLOPs and 15.34M parameters—marked reductions in computational demand relative to previous architectures.

Qualitative Improvements

Visualizations reveal that UAGLNet retains superior building boundary integrity, correctly segments occluded or ambiguous structures, and is less prone to false positives in noisy or low-resolution contexts. The integration of local features preserves granularity, while the uncertainty-guided fusion suppresses erroneous activations.

(Figure 2)

Figure 2: UAGLNet overview illustrating complementary local and global representations guided by uncertainty modeling.

Ablation and Architectural Analysis

Ablation studies confirm the contributions of MKFM, CIB, GLF, and UAD. Incorporation of CIB and GLF yields consistent gains in IoU and F1 metrics, with further improvements obtained via uncertainty aggregation in the decoder. Comparative analysis with parallel, sequential, and alternative hybrid architectures demonstrates that the proposed cooperative design offers the best global-local balancing, improving semantic gap closure.

Efficiency, Generalization, and Robustness

UAGLNet achieves real-time inference (27.53 FPS) and demonstrates robust generalization in cross-domain settings (e.g., training on Inria, testing on WHU, experiencing minimal performance degradation relative to baselines). High-resolution testing evidences substantial computational saving while maintaining accuracy. UAD enhances resilience to synthetic noise and resolution degradation, a critical property for deployment on diverse RS platforms.

Implications and Future Directions

Practically, UAGLNet sets a new standard for building extraction pipelines, providing more reliable, uncertainty-aware segmentation suited for urban planning, mapping, and GIS. Theoretically, the cooperative encoder-decoder paradigm and uncertainty aggregation could be generalized to other RS segmentation tasks, and adapted for joint learning in multi-modal, multi-task settings. Prospective advancements include integration with super-resolution modules, further model compression, and extension to semantic tasks beyond building extraction.

Conclusion

UAGLNet advances building extraction by integrating hybrid CNN-transformer operators for balanced global-local semantics, hierarchical fusion, and explicit uncertainty attenuation. The cooperation between multi-scale feature modulation and transformer attention, coupled with pixel-wise uncertainty refinement, yields state-of-the-art performance with efficiency and generalization, and establishes a scalable paradigm for robust RS image analysis.

Markdown Report Issue