Papers
Topics
Authors
Recent
Search
2000 character limit reached

CPF-CTE: Patch Fusion with Class Token Enhancement

Updated 28 January 2026
  • The paper introduces CPF-CTE, a novel framework that combines ViT-based patch tokenization, CF-BiLSTM, and learnable class tokens to capture bidirectional spatial dependencies in WSSS.
  • It employs a CF-BiLSTM module for bidirectional spatial propagation, improving local feature representation and generating high-quality pseudo-labels with minimal computational overhead.
  • Ablation studies and benchmark comparisons on VOC and COCO datasets demonstrate CPF-CTE’s superior performance, achieving higher mIoU scores than prior methods.

Context Patch Fusion with Class Token Enhancement (CPF-CTE) is a framework for weakly supervised semantic segmentation (WSSS) that integrates patch-level context modeling with dynamic, class-specific feature enhancement. Designed to address the challenge of capturing complex spatial dependencies and semantic ambiguities in WSSS, CPF-CTE systematically combines a Vision Transformer (ViT) backbone, a Contextual-Fusion Bidirectional Long Short-Term Memory module (CF-BiLSTM), and learnable class token enhancement. This synthesis yields improved local representation, robust segmentation, and superior pseudo-label generation on challenging datasets, outperforming prior methods with minimal computational overhead (Fu et al., 21 Jan 2026).

1. Problem Definition and Motivation

Weakly supervised semantic segmentation aims to assign semantic class labels to each pixel in an image using only image-level labels as supervision signals. Conventional approaches emphasize inter-class discrimination and apply data augmentation to reduce spurious activations. However, they often yield incomplete segmentation due to neglected contextual dependencies among patches. This omission limits the granularity of local representations and segmentation accuracy.

CPF-CTE targets these deficiencies by:

  • Explicitly modeling bidirectional spatial dependencies between image patches.
  • Incorporating learnable class tokens that refine patch-level feature semantics dynamically.
  • Jointly leveraging spatial and semantic cues to ameliorate ambiguous or incomplete activations, especially for small or occluded objects.

2. Architectural Overview and Data Flow

The CPF-CTE pipeline comprises four principal stages:

  1. Patch Tokenization and ViT Encoding
    • The input RGB image II with CC possible classes is resized (384×384 at train, up to 960×960 at inference), and split into non-overlapping 16×16 patches, yielding s=(384/16)2=576s=(384/16)^2=576 tokens.
    • Each patch is embedded via a linear projection and positional encoding, producing PRs×eP \in \mathbb{R}^{s \times e}.
    • PP is processed by a ViT-B/16 encoder to obtain context-enriched embeddings PvitRs×eP^{vit} \in \mathbb{R}^{s \times e}.
  2. Class Token Enhancement and Context Patch Fusion
    • CPF-CTE introduces CC learnable class tokens T={t1,,tC}RC×HT = \{t_1,\dots,t_C\} \in \mathbb{R}^{C \times H}, each token tcRHt_c \in \mathbb{R}^H.
    • For each patch, its ViT embedding pivitp_i^{vit} is concatenated with its associated class token tct_c to yield fiin=Concat(pivit,tc)Re+Hf_i^{in} = \mathrm{Concat}(p_i^{vit}, t_c) \in \mathbb{R}^{e+H}.
    • The sequence Fin=[f1in,,fsin]F^{in} = [f_1^{in},\ldots, f_s^{in}] is input to CF-BiLSTM, which propagates bidirectional spatial context across both horizontal and vertical axes.
  3. Patch Classification and Pseudo-label Generation
    • A linear classifier WR(e+H)×CW \in \mathbb{R}^{(e+H)\times C}, followed by softmax, computes patch-level class scores Z=softmax(FoutW)Rs×CZ = \mathrm{softmax}(F^{out} W) \in \mathbb{R}^{s \times C}.
    • ZZ is upsampled to the original spatial grid using bilinear interpolation, generating the Baseline Pseudo Mask (BPM).
    • A fully connected CRF sharpens boundaries and eliminates isolated spurious activations, producing final pseudo-labels.
  4. Fully Supervised Segmentation Training
    • DeepLabv2 is trained using these pseudo-labels as ground truth.
    • At inference, DeepLabv2 outputs the final high-resolution segmentation mask.

3. Contextual-Fusion BiLSTM (CF-BiLSTM) Module

CF-BiLSTM restores fine-grained spatial continuity lost during patch tokenization by explicitly modeling spatial dependencies:

  • Mathematical Formulation:
    • Forward LSTM: hi=LSTM(fiin,hi1)\overrightarrow{h_i} = \overrightarrow{\mathrm{LSTM}}(f_i^{in}, \overrightarrow{h}_{i-1})
    • Backward LSTM: hi=LSTM(fiin,hi+1)\overleftarrow{h_i} = \overleftarrow{\mathrm{LSTM}}(f_i^{in}, \overleftarrow{h}_{i+1})
    • Concatenation: hi=Concat(hi,hi)R2dh_i = \mathrm{Concat}(\overrightarrow{h_i}, \overleftarrow{h_i}) \in \mathbb{R}^{2d}
    • Projection: fiout=Wchi+bcf_i^{out} = W_c h_i + b_c, WcRD×2dW_c \in \mathbb{R}^{D \times 2d}, bcRDb_c \in \mathbb{R}^D
  • Bidirectional Spatial Propagation:

Two BiLSTM passes are executed: 1. BiLSTM_H: Horizontal (row-major) pass models left-right and right-left context. 2. BiLSTM_V: Vertical (column-major) pass captures top-bottom and bottom-top context. The outputs are concatenated and projected, yielding globally context-rich patch features.

  • Implementation Details:

Row-major and column-major sequencing ensures patch adjacency in the LSTM corresponds to true spatial adjacency, critical for effective fusion.

4. Learnable Class Token Enhancement (CTE)

Class tokens provide dynamic semantic conditioning for each patch:

  • Initialization and Structure:

CC tokens TRC×HT \in \mathbb{R}^{C \times H} are randomly initialized with a Gaussian distribution and optimized via backpropagation. Commonly, H=eH = e.

  • Mechanism:

For each patch and class, fi,cin=Concat(pivit,tc)Re+Hf_{i,c}^{in} = \mathrm{Concat}(p_i^{vit}, t_c) \in \mathbb{R}^{e+H}. The network simultaneously retains all CC tokens and predicts per-class patch confidence.

  • Semantic Enhancement:

By concatenating tct_c to each patch's feature, the system enables class-aware semantic anchoring, mapping ambiguous textures to class-disentangled representations. Over training, tct_c evolves into a semantic prototype for class cc.

  • Gradient Flow:

Backpropagation updates both patch embeddings pivitp_i^{vit} and class tokens tct_c, enhancing discriminative capability for underrepresented and overlapping classes.

5. Supervision, Loss Functions, and Training Strategy

  • Patch-to-Image Supervision:

Patch class scores are aggregated for image-level label compatibility using Top-kk pooling:

pc=1kj=1kTopK(Z:,c)[j]p_c = \frac{1}{k} \sum_{j=1}^k \mathrm{TopK}(Z_{:,c})[j]

  • Multi-class Entropy Loss:

LMCE=1Cc=1C[yclogpc+(1yc)log(1pc)]\mathcal{L}_{MCE} = -\frac{1}{C}\sum_{c=1}^C \bigl[ y_c \log p_c + (1-y_c) \log(1-p_c) \bigr]

with yc{0,1}y_c \in \{0,1\} denoting image-level class presence.

  • Regularization and Augmentation:

Standard random cropping and horizontal flipping are applied. Top-kk pooling (typically k=4k=4) serves as implicit regularization by focusing loss on the most confident regions.

  • Two-Stage Paradigm:
  1. CPF-CTE is trained to minimize LMCE\mathcal{L}_{MCE}, learning the ViT backbone, class tokens, CF-BiLSTM, and classifier head.
  2. Pseudo-labels are generated via CRF refinement of BPM.
  3. DeepLabv2 is trained in fully supervised fashion using these pseudo-labels, employing standard pixel-wise cross-entropy.

6. Empirical Results and Comparative Analysis

CPF-CTE has been evaluated extensively on PASCAL VOC 2012 and MS COCO 2014, with results summarized in the following tables.

Pseudo-label Quality on VOC2012 train (mIoU %):

Method mIoU
SEAM 63.6
AFA 66.0
PGSD 68.7
CGM 68.1
CPF-CTE w/o CRF 67.1
CPF-CTE 70.8

Final Segmentation on VOC2012 val (mIoU %):

Method mIoU
AFA 63.8
PGSD 68.7
CGM 67.8
CPF-CTE 69.5

Segmentation on COCO2014 val (mIoU %):

Method mIoU
AFA 38.9
SAS 44.5
FPR 43.9
ToCo 42.3
PGSD 43.5
CGM 40.1
CPF-CTE 45.4
  • Ablation on VOC2012 val (mIoU %):
    • ViT only: 65.2
    • + class tokens: 67.1
    • + context enhancement: 67.8
    • Full CPF-CTE: 69.5
  • Pooling Strategy Comparison (VOC val mIoU %):
    • Average: 67.7
    • Max: 68.1
    • Top-k (k=4): 69.5

CPF-CTE consistently achieves more complete activations, sharper boundaries, and enhanced performance on small or occluded classes. Gains are additive with both CF-BiLSTM and class token modules, as demonstrated by ablation results.

7. Synthesis and Significance

CPF-CTE integrates (1) a ViT backbone for rich global context, (2) a lightweight CF-BiLSTM for bidirectional spatial continuity, and (3) learnable class tokens for dynamic class-discriminative semantics. This architectural synthesis supports end-to-end weakly supervised segmentation and delivers state-of-the-art pseudo-labeling (70.8% mIoU on VOC train) and strong final segmentation (69.5% VOC val, 45.4% COCO val). Both quantitative and qualitative results attest to its ability to address key limitations of traditional patch-based and class-agnostic approaches, demonstrating efficacy across varying object sizes and occlusion settings. The ablation studies confirm the complementary and significant contributions of context fusion and class token enhancement (Fu et al., 21 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Context Patch Fusion with Class Token Enhancement (CPF-CTE).