CPF-CTE: Patch Fusion with Class Token Enhancement
- The paper introduces CPF-CTE, a novel framework that combines ViT-based patch tokenization, CF-BiLSTM, and learnable class tokens to capture bidirectional spatial dependencies in WSSS.
- It employs a CF-BiLSTM module for bidirectional spatial propagation, improving local feature representation and generating high-quality pseudo-labels with minimal computational overhead.
- Ablation studies and benchmark comparisons on VOC and COCO datasets demonstrate CPF-CTE’s superior performance, achieving higher mIoU scores than prior methods.
Context Patch Fusion with Class Token Enhancement (CPF-CTE) is a framework for weakly supervised semantic segmentation (WSSS) that integrates patch-level context modeling with dynamic, class-specific feature enhancement. Designed to address the challenge of capturing complex spatial dependencies and semantic ambiguities in WSSS, CPF-CTE systematically combines a Vision Transformer (ViT) backbone, a Contextual-Fusion Bidirectional Long Short-Term Memory module (CF-BiLSTM), and learnable class token enhancement. This synthesis yields improved local representation, robust segmentation, and superior pseudo-label generation on challenging datasets, outperforming prior methods with minimal computational overhead (Fu et al., 21 Jan 2026).
1. Problem Definition and Motivation
Weakly supervised semantic segmentation aims to assign semantic class labels to each pixel in an image using only image-level labels as supervision signals. Conventional approaches emphasize inter-class discrimination and apply data augmentation to reduce spurious activations. However, they often yield incomplete segmentation due to neglected contextual dependencies among patches. This omission limits the granularity of local representations and segmentation accuracy.
CPF-CTE targets these deficiencies by:
- Explicitly modeling bidirectional spatial dependencies between image patches.
- Incorporating learnable class tokens that refine patch-level feature semantics dynamically.
- Jointly leveraging spatial and semantic cues to ameliorate ambiguous or incomplete activations, especially for small or occluded objects.
2. Architectural Overview and Data Flow
The CPF-CTE pipeline comprises four principal stages:
- Patch Tokenization and ViT Encoding
- The input RGB image with possible classes is resized (384×384 at train, up to 960×960 at inference), and split into non-overlapping 16×16 patches, yielding tokens.
- Each patch is embedded via a linear projection and positional encoding, producing .
- is processed by a ViT-B/16 encoder to obtain context-enriched embeddings .
- Class Token Enhancement and Context Patch Fusion
- CPF-CTE introduces learnable class tokens , each token .
- For each patch, its ViT embedding is concatenated with its associated class token to yield .
- The sequence is input to CF-BiLSTM, which propagates bidirectional spatial context across both horizontal and vertical axes.
- Patch Classification and Pseudo-label Generation
- A linear classifier , followed by softmax, computes patch-level class scores .
- is upsampled to the original spatial grid using bilinear interpolation, generating the Baseline Pseudo Mask (BPM).
- A fully connected CRF sharpens boundaries and eliminates isolated spurious activations, producing final pseudo-labels.
- Fully Supervised Segmentation Training
- DeepLabv2 is trained using these pseudo-labels as ground truth.
- At inference, DeepLabv2 outputs the final high-resolution segmentation mask.
3. Contextual-Fusion BiLSTM (CF-BiLSTM) Module
CF-BiLSTM restores fine-grained spatial continuity lost during patch tokenization by explicitly modeling spatial dependencies:
- Mathematical Formulation:
- Forward LSTM:
- Backward LSTM:
- Concatenation:
- Projection: , ,
- Bidirectional Spatial Propagation:
Two BiLSTM passes are executed: 1. BiLSTM_H: Horizontal (row-major) pass models left-right and right-left context. 2. BiLSTM_V: Vertical (column-major) pass captures top-bottom and bottom-top context. The outputs are concatenated and projected, yielding globally context-rich patch features.
- Implementation Details:
Row-major and column-major sequencing ensures patch adjacency in the LSTM corresponds to true spatial adjacency, critical for effective fusion.
4. Learnable Class Token Enhancement (CTE)
Class tokens provide dynamic semantic conditioning for each patch:
- Initialization and Structure:
tokens are randomly initialized with a Gaussian distribution and optimized via backpropagation. Commonly, .
- Mechanism:
For each patch and class, . The network simultaneously retains all tokens and predicts per-class patch confidence.
- Semantic Enhancement:
By concatenating to each patch's feature, the system enables class-aware semantic anchoring, mapping ambiguous textures to class-disentangled representations. Over training, evolves into a semantic prototype for class .
- Gradient Flow:
Backpropagation updates both patch embeddings and class tokens , enhancing discriminative capability for underrepresented and overlapping classes.
5. Supervision, Loss Functions, and Training Strategy
- Patch-to-Image Supervision:
Patch class scores are aggregated for image-level label compatibility using Top- pooling:
- Multi-class Entropy Loss:
with denoting image-level class presence.
- Regularization and Augmentation:
Standard random cropping and horizontal flipping are applied. Top- pooling (typically ) serves as implicit regularization by focusing loss on the most confident regions.
- Two-Stage Paradigm:
- CPF-CTE is trained to minimize , learning the ViT backbone, class tokens, CF-BiLSTM, and classifier head.
- Pseudo-labels are generated via CRF refinement of BPM.
- DeepLabv2 is trained in fully supervised fashion using these pseudo-labels, employing standard pixel-wise cross-entropy.
6. Empirical Results and Comparative Analysis
CPF-CTE has been evaluated extensively on PASCAL VOC 2012 and MS COCO 2014, with results summarized in the following tables.
Pseudo-label Quality on VOC2012 train (mIoU %):
Final Segmentation on VOC2012 val (mIoU %):
| Method | mIoU |
|---|---|
| AFA | 63.8 |
| PGSD | 68.7 |
| CGM | 67.8 |
| CPF-CTE | 69.5 |
Segmentation on COCO2014 val (mIoU %):
| Method | mIoU |
|---|---|
| AFA | 38.9 |
| SAS | 44.5 |
| FPR | 43.9 |
| ToCo | 42.3 |
| PGSD | 43.5 |
| CGM | 40.1 |
| CPF-CTE | 45.4 |
- Ablation on VOC2012 val (mIoU %):
- ViT only: 65.2
- + class tokens: 67.1
- + context enhancement: 67.8
- Full CPF-CTE: 69.5
- Pooling Strategy Comparison (VOC val mIoU %):
- Average: 67.7
- Max: 68.1
- Top-k (k=4): 69.5
CPF-CTE consistently achieves more complete activations, sharper boundaries, and enhanced performance on small or occluded classes. Gains are additive with both CF-BiLSTM and class token modules, as demonstrated by ablation results.
7. Synthesis and Significance
CPF-CTE integrates (1) a ViT backbone for rich global context, (2) a lightweight CF-BiLSTM for bidirectional spatial continuity, and (3) learnable class tokens for dynamic class-discriminative semantics. This architectural synthesis supports end-to-end weakly supervised segmentation and delivers state-of-the-art pseudo-labeling (70.8% mIoU on VOC train) and strong final segmentation (69.5% VOC val, 45.4% COCO val). Both quantitative and qualitative results attest to its ability to address key limitations of traditional patch-based and class-agnostic approaches, demonstrating efficacy across varying object sizes and occlusion settings. The ablation studies confirm the complementary and significant contributions of context fusion and class token enhancement (Fu et al., 21 Jan 2026).