Segmentation-Guided CXR Classification Pipeline
- The paper introduces a multi-stage pipeline where a semantic segmentation network isolates lung regions before a deep classifier assesses abnormalities.
- The method enhances interpretability by aligning AI-generated heatmaps with clinically relevant regions and reduces bias from extraneous image features.
- Empirical results demonstrate improved lesion localization and model generalization across diverse thoracic pathologies including COVID-19 and tuberculosis.
A segmentation-guided CXR (chest X-ray) classification pipeline integrates explicit anatomical region extraction—typically the lung fields—into the process of automated abnormality classification or disease screening. This paradigm aims to improve model robustness, interpretability, and generalization to diverse clinical and technical settings by enforcing anatomical priors at the image pre-processing or feature-extraction stages. Segmentation-guided methodology is characterized by a two-stage or multi-stage architecture: a dedicated semantic segmentation network first delineates the region(s) of interest, and downstream classifiers are subsequently trained on the cropped or masked region(s), rather than the full radiograph. Recent advances demonstrate that such pipelines improve lesion localization, reduce bias from spurious image features, and enable more reliable model explanations across a range of CXR-based diagnostic tasks (Teixeira et al., 2020, Liu et al., 2018, Miao et al., 28 Dec 2025, Zifei et al., 19 Dec 2025, Abdulah et al., 2021, Azimi et al., 2022).
1. General Pipeline Structure and Rationale
A typical segmentation-guided CXR classification pipeline is modular, comprising at least two distinct computational stages:
- Semantic Segmentation: A neural network, usually based on U-Net or its transformer/attention-based variants, is trained to produce high-fidelity masks for one or more anatomical structures—most frequently the lungs, but also heart and bones in multi-organ applications.
- Region Extraction: The mask is post-processed (e.g., morphological operations, bounding box calculation, ROI definition) to derive either a tight/loose crop, a masked image (zeroing non-lung regions), or a set of regional sub-images (e.g., upper/middle/lower thirds).
- Classification: The downstream convolutional neural network (CNN) or alternative backbone (e.g., MoCo encoder, DenseNet, ResNet, Inception) receives as input the segmented crop/mask, potentially concatenated with the original image, and is trained to detect target labels (disease presence, multi-label abnormality, etc.).
- Interpretability & Bias Mitigation (optional): Explainable AI techniques (LIME, Grad-CAM) are used to validate that model decisions are spatially and anatomically plausible, and source-bias ablation is assessed through cross-dataset or cross-source testing (Teixeira et al., 2020, Liu et al., 2018, Azimi et al., 2022).
The principal motivation is to “focus” the classifier on regions with genuine clinical relevance, reducing sensitivity to irrelevant confounders such as burned-in annotations, positioning artefacts, or non-pulmonary anatomy. This is corroborated by repeated empirical findings: segmentation reduces model reliance on spurious background cues and increases the congruence between AI-generated saliency maps and expert-defined relevant CXR regions (Teixeira et al., 2020, Azimi et al., 2022, Liu et al., 2018, Abdulah et al., 2021, Miao et al., 28 Dec 2025).
2. Segmentation Module: Architectures, Preprocessing, and Evaluation
State-of-the-art CXR segmentation modules leverage U-Net backbones, enhancements with attention (XLSor with Criss-Cross Attention), transformers (MedT), or foundation models fine-tuned for medical imaging (MedSAM). Architectures such as nnU-Net and Res-CR-Net further accommodate scale, view, and anatomical variability (Miao et al., 28 Dec 2025, Abdulah et al., 2021, Capellán-Martín et al., 2023, Zifei et al., 19 Dec 2025).
- Architectural Details: The classic U-Net design consists of a contracting path (encoder) coupled with an expanding path (decoder), employing skip-connections to maintain high-resolution spatial features. Variants employ batch normalization, dropout, and architectural blocks such as separable atrous convolutions or criss-cross attention modules (Teixeira et al., 2020, Azimi et al., 2022, Abdulah et al., 2021, Capellán-Martín et al., 2023).
- Loss Functions: Training objectives typically combine binary cross-entropy (BCE) and Dice loss, though more specialized losses such as Tanimoto or mean-squared-error are used in some settings. Post-processing steps include morphological operations (erosion, dilation, opening) to refine mask boundaries and suppress noise (Teixeira et al., 2020, Miao et al., 28 Dec 2025, Abdulah et al., 2021).
- Data Augmentation and Synthesis: Domain randomization (AnyCXR), on-the-fly affine and intensity perturbations (shift-scale-rotate, brightness/contrast jitter) are standard to ensure segmentation robustness to real-world variability (Zifei et al., 19 Dec 2025, Teixeira et al., 2020, Abdulah et al., 2021).
- Evaluation Metrics: Segmentation fidelity is assessed using Jaccard distance (), Dice coefficient (), Intersection over Union (IoU), precision, recall, and average surface distance. Reported Dice scores for competent segmentation modules typically exceed 0.96 on held-out CXR datasets (Teixeira et al., 2020, Abdulah et al., 2021, Azimi et al., 2022, Capellán-Martín et al., 2023).
3. Incorporation of Segmentation into Classification
Segmentation can condition classification in several canonical ways:
- Cropped Inputs: The classifier is trained solely on the lung-cropped regions defined by the segmentation mask, achieving higher effective resolution and removing background distractions (Teixeira et al., 2020, Liu et al., 2018, Azimi et al., 2022).
- Mask Multiplication: Input images are pixelwise-multiplied by the mask, zeroing non-lung regions. Mask dilation/erosion allows "tight" or "loose" masking, trading off specificity of region focus and retention of contextual cues (Miao et al., 28 Dec 2025).
- Dual-Stream Fusion: Architectures such as SDFN incorporate both global (full image) and local (lung-cropped) pathways, concatenating their features prior to classification. Feature fusion consistently increases AUC across multiple diseases (Liu et al., 2018).
- Patch Decomposition: To improve localization, some workflows tile the lung region into overlapping patches, assigning patch-level labels for detailed region-specific training and interpretability (Tai et al., 2021).
- Mask Stacking: Multi-organ segmentation enables stacking multiple anatomical masks (lungs, heart, bones) as channels to inform the classifier of spatial priors and enhance multi-pathology detection (Zifei et al., 19 Dec 2025).
Segmentation granularity and strictness (tight/loose masking) is a tunable hyperparameter: tight masking increases focus but can degrade performance for pathologies with peri-lung or mediastinal manifestations; loose masking can preserve relevant context and substantially enhance discrimination of "normal" (no finding) cases (Miao et al., 28 Dec 2025). Models with dual pathways or multi-channel fusion can capture both high-resolution intra-lung patterns and broader diagnostic context.
4. Empirical Results: Classification Performance and Ablation
Segmentation-guided pipelines have been quantitatively benchmarked across COVID-19 identification, multi-label thoracic disease screening, nodule detection, and pediatric tuberculosis region extraction:
- Multi-class and Binary COVID-19 Classification: U-Net + InceptionV3 pipelines achieve macro F1=0.88 on segmented lungs and F1(COVID)=0.83; non-segmented inputs yield slightly higher F1 (0.90 macro, 0.86 COVID), but segmentation significantly improves the anatomical focus of model explanations (Teixeira et al., 2020).
- Multi-label Disease Detection: SDFN fusion achieves mean AUC=0.815 over 14 NIH ChestX-ray14 pathologies, outperforming single-bbox or whole-image approaches at p<0.01. Notably, the highest AUC gains are seen in small-lesion pathologies such as nodule and mass detection (Liu et al., 2018).
- Patch-wise Nodule Detection: Segmentation followed by patch extraction and ResNet34 patch classification achieves sensitivity=0.78, specificity=0.79, AUROC=0.837 overall, with superior performance on standard (non-difficult) cases (Tai et al., 2021).
- Pediatric TB Region Standardization: High-quality segmentation (Dice≈0.97 for nnU-Net) enables standardized extraction of sub-lobar and mediastinal ROIs for downstream analysis, with direct impact on future region-guided TB detectors (Capellán-Martín et al., 2023).
- Anatomy-aware Classification: AnyCXR's multi-organ masks, when concatenated with original images, increase mean AUROC to 82.3% (vs. 80.26% for raw CXR baseline), with pathology-specific improvements exceeding 3% in atelectasis, pleural thickening, and cardiomegaly (Zifei et al., 19 Dec 2025).
Cross-dataset and ablation studies establish that segmentation-guided models show consistent out-of-domain robustness, lower misattribution to non-lung regions, and superior generalization compared to classifiers trained on original CXRs without mask-based focus (Azimi et al., 2022, Teixeira et al., 2020).
5. Model Interpretability and Dataset Bias
The integration of segmentation enhances the spatial validity of post-hoc explainable AI methods:
- Heatmap Attribution: Grad-CAM and LIME analyses reveal that models trained on full images often ascribe decisions to irrelevant regions (e.g., image borders, external labels). When segmentation is applied, these saliency maps are tightly colocalized with the pulmonary regions, congruent with expert annotations (Teixeira et al., 2020, Azimi et al., 2022, Liu et al., 2018, Abdulah et al., 2021).
- Bias Reduction: While segmentation reduces bias by suppressing dataset-specific cues (e.g., scanner vendor marks, text), experiments demonstrate that residual dataset bias may persist. Thus, segmentation must be accompanied by rigorous cross-dataset validation, and explainability analyses are critical for confirming that classifier focus corresponds to plausible clinical features (Teixeira et al., 2020, Miao et al., 28 Dec 2025).
- High-Resolution Localization: Designs with no spatial pooling (CXR-Net) allow fine-grained Grad-CAM heatmaps, and the masking step ensures that saliency is confined to lung parenchyma, providing both interpretability and biologically plausible lesion localization (Abdulah et al., 2021).
A plausible implication is that anatomical masking is a necessary—but not always sufficient—step toward mitigating spurious model reliance and dataset bias. Segmentation should be considered a component within a broader interpretability and external validation framework.
6. Task-Dependent Design Decisions and Limitations
Recent research emphasizes the task-, architecture-, and dataset-dependent nature of segmentation-guided strategies:
- Mask Strictness: Masking tightly around the lungs may degrade performance on abnormalities with peri-hilar, pleural, or mediastinal manifestations, and can reduce abnormality-specific AUROC. Loose masking can recover much of this loss and improve "No Finding" discrimination, highlighting the need to select mask parameters in accordance with the clinical screening objective and CNN backbone (Miao et al., 28 Dec 2025).
- Multi-organ Segmentation and Generalizability: AnyCXR demonstrates that segmentation models trained entirely on domain-randomized synthetic data can still improve real-world CXR multi-label diagnosis by 2.04% mean AUROC, implying that anatomical priors are transferable and generalize across projection angle, hospital, and patient population (Zifei et al., 19 Dec 2025).
- Modular versus End-to-End Training: Most segmentation-guided pipelines train segmentation and classification sequentially. A future direction, enabled by architectures such as AnyCXR-CAR, is full end-to-end joint optimization, potentially improving downstream task metrics further (Zifei et al., 19 Dec 2025).
- Interpretability vs. Raw Metrics: In several studies, segmentation does not always yield higher F1/AUC compared to raw-image-trained baselines on in-domain data. However, segmentation consistently increases cognitive fidelity, spatial plausibility, and out-of-distribution robustness (Teixeira et al., 2020, Azimi et al., 2022, Liu et al., 2018).
7. Representative Pipelines and Quantitative Summary
| Pipeline / Study | Segmentation Backbone | Classification Backbone | Segmentation Dice (test) | Macro AUC/F1/AUROC | Remarks |
|---|---|---|---|---|---|
| (Teixeira et al., 2020) | U-Net, BCE loss, 400×400 | InceptionV3, ResNet50V2 | 0.982 | F1=0.88 (segm), 0.90 (orig) | Reduces non-lung heatmap focus |
| (Liu et al., 2018) – SDFN | U-Net (Pazhitnykh&Petsiuk), 256x256 | DenseNet-121 dual-stream | 0.98 (JSRT) | AUC=0.815 (fusion) | Fusion outperforms single-stream |
| (Miao et al., 28 Dec 2025) – MedSAM-based | MedSAM (finetuned), 1024×1024 | DenseNet121/ResNet50 | Dice=0.984 (val) | AUROC=0.837–0.879 | Masking regime matters |
| (Azimi et al., 2022) – XLSor | XLSor (U-Net+CCA), 512×512 | MoCo v2 (ResNet-50) | 0.968 (SH) | Acc=0.946/F1=0.939 (test) | Improves interpretability |
| (Tai et al., 2021) – Nodule detection | U-Net+DenseNet161 | ResNet-34 (patch-level) | IoU=0.9228 (val) | Sens=0.78/Spec=0.79/ROC=0.837 | Patch-based loc, fast inference |
| (Zifei et al., 19 Dec 2025) – AnyCXR | U-Net+ResNet-50 (synthetic DRRs) | ResNet-50 (4-channel) | PA Dice=0.951 (real CXR) | AUROC=82.3% (seg-gd) | Multi-organ, 2% AUROC gain |
A plausible implication is that pipelines integrating segmentation as spatial prior, especially via strategic mask or ROI selection and anatomy-aware fusion, can consistently improve spatial validity and generalization for CXR abnormality classification across diverse datasets, target pathologies, and clinical requirements. There remains a performance trade-off between context preservation and region focus; thus, mask design and integration strategy should be selected based on the intended diagnostic task and desired explainability.