CPF-CTE: Patch Fusion with Class Token Enhancement

Updated 28 January 2026

The paper introduces CPF-CTE, a novel framework that combines ViT-based patch tokenization, CF-BiLSTM, and learnable class tokens to capture bidirectional spatial dependencies in WSSS.
It employs a CF-BiLSTM module for bidirectional spatial propagation, improving local feature representation and generating high-quality pseudo-labels with minimal computational overhead.
Ablation studies and benchmark comparisons on VOC and COCO datasets demonstrate CPF-CTE’s superior performance, achieving higher mIoU scores than prior methods.

Context Patch Fusion with Class Token Enhancement (CPF-CTE) is a framework for weakly supervised semantic segmentation (WSSS) that integrates patch-level context modeling with dynamic, class-specific feature enhancement. Designed to address the challenge of capturing complex spatial dependencies and semantic ambiguities in WSSS, CPF-CTE systematically combines a Vision Transformer (ViT) backbone, a Contextual-Fusion Bidirectional Long Short-Term Memory module (CF-BiLSTM), and learnable class token enhancement. This synthesis yields improved local representation, robust segmentation, and superior pseudo-label generation on challenging datasets, outperforming prior methods with minimal computational overhead (Fu et al., 21 Jan 2026).

1. Problem Definition and Motivation

Weakly supervised semantic segmentation aims to assign semantic class labels to each pixel in an image using only image-level labels as supervision signals. Conventional approaches emphasize inter-class discrimination and apply data augmentation to reduce spurious activations. However, they often yield incomplete segmentation due to neglected contextual dependencies among patches. This omission limits the granularity of local representations and segmentation accuracy.

CPF-CTE targets these deficiencies by:

Explicitly modeling bidirectional spatial dependencies between image patches.
Incorporating learnable class tokens that refine patch-level feature semantics dynamically.
Jointly leveraging spatial and semantic cues to ameliorate ambiguous or incomplete activations, especially for small or occluded objects.

2. Architectural Overview and Data Flow

The CPF-CTE pipeline comprises four principal stages:

Patch Tokenization and ViT Encoding
- The input RGB image $I$ with $C$ possible classes is resized (384×384 at train, up to 960×960 at inference), and split into non-overlapping 16×16 patches, yielding $s=(384/16)^2=576$ tokens.
- Each patch is embedded via a linear projection and positional encoding, producing $P \in \mathbb{R}^{s \times e}$ .
- $P$ is processed by a ViT-B/16 encoder to obtain context-enriched embeddings $P^{vit} \in \mathbb{R}^{s \times e}$ .
Class Token Enhancement and Context Patch Fusion
- CPF-CTE introduces $C$ learnable class tokens $T = \{t_1,\dots,t_C\} \in \mathbb{R}^{C \times H}$ , each token $t_c \in \mathbb{R}^H$ .
- For each patch, its ViT embedding $p_i^{vit}$ is concatenated with its associated class token $t_c$ to yield $f_i^{in} = \mathrm{Concat}(p_i^{vit}, t_c) \in \mathbb{R}^{e+H}$ .
- The sequence $F^{in} = [f_1^{in},\ldots, f_s^{in}]$ is input to CF-BiLSTM, which propagates bidirectional spatial context across both horizontal and vertical axes.
Patch Classification and Pseudo-label Generation
- A linear classifier $W \in \mathbb{R}^{(e+H)\times C}$ , followed by softmax, computes patch-level class scores $Z = \mathrm{softmax}(F^{out} W) \in \mathbb{R}^{s \times C}$ .
- $Z$ is upsampled to the original spatial grid using bilinear interpolation, generating the Baseline Pseudo Mask (BPM).
- A fully connected CRF sharpens boundaries and eliminates isolated spurious activations, producing final pseudo-labels.
Fully Supervised Segmentation Training
- DeepLabv2 is trained using these pseudo-labels as ground truth.
- At inference, DeepLabv2 outputs the final high-resolution segmentation mask.

3. Contextual-Fusion BiLSTM (CF-BiLSTM) Module

CF-BiLSTM restores fine-grained spatial continuity lost during patch tokenization by explicitly modeling spatial dependencies:

Mathematical Formulation:
- Forward LSTM: $\overrightarrow{h_i} = \overrightarrow{\mathrm{LSTM}}(f_i^{in}, \overrightarrow{h}_{i-1})$
- Backward LSTM: $\overleftarrow{h_i} = \overleftarrow{\mathrm{LSTM}}(f_i^{in}, \overleftarrow{h}_{i+1})$
- Concatenation: $h_i = \mathrm{Concat}(\overrightarrow{h_i}, \overleftarrow{h_i}) \in \mathbb{R}^{2d}$
- Projection: $f_i^{out} = W_c h_i + b_c$ , $W_c \in \mathbb{R}^{D \times 2d}$ , $b_c \in \mathbb{R}^D$
Bidirectional Spatial Propagation:

Two BiLSTM passes are executed: 1. BiLSTM_H: Horizontal (row-major) pass models left-right and right-left context. 2. BiLSTM_V: Vertical (column-major) pass captures top-bottom and bottom-top context. The outputs are concatenated and projected, yielding globally context-rich patch features.

Implementation Details:

Row-major and column-major sequencing ensures patch adjacency in the LSTM corresponds to true spatial adjacency, critical for effective fusion.

4. Learnable Class Token Enhancement (CTE)

Class tokens provide dynamic semantic conditioning for each patch:

Initialization and Structure:

$C$ tokens $T \in \mathbb{R}^{C \times H}$ are randomly initialized with a Gaussian distribution and optimized via backpropagation. Commonly, $H = e$ .

Mechanism:

For each patch and class, $f_{i,c}^{in} = \mathrm{Concat}(p_i^{vit}, t_c) \in \mathbb{R}^{e+H}$ . The network simultaneously retains all $C$ tokens and predicts per-class patch confidence.

Semantic Enhancement:

By concatenating $t_c$ to each patch's feature, the system enables class-aware semantic anchoring, mapping ambiguous textures to class-disentangled representations. Over training, $t_c$ evolves into a semantic prototype for class $c$ .

Gradient Flow:

Backpropagation updates both patch embeddings $p_i^{vit}$ and class tokens $t_c$ , enhancing discriminative capability for underrepresented and overlapping classes.

5. Supervision, Loss Functions, and Training Strategy

Patch-to-Image Supervision:

Patch class scores are aggregated for image-level label compatibility using Top- $k$ pooling:

$p_c = \frac{1}{k} \sum_{j=1}^k \mathrm{TopK}(Z_{:,c})[j]$

Multi-class Entropy Loss:

$\mathcal{L}_{MCE} = -\frac{1}{C}\sum_{c=1}^C \bigl[ y_c \log p_c + (1-y_c) \log(1-p_c) \bigr]$

with $y_c \in \{0,1\}$ denoting image-level class presence.

Regularization and Augmentation:

Standard random cropping and horizontal flipping are applied. Top- $k$ pooling (typically $k=4$ ) serves as implicit regularization by focusing loss on the most confident regions.

Two-Stage Paradigm:

CPF-CTE is trained to minimize $\mathcal{L}_{MCE}$ , learning the ViT backbone, class tokens, CF-BiLSTM, and classifier head.
Pseudo-labels are generated via CRF refinement of BPM.
DeepLabv2 is trained in fully supervised fashion using these pseudo-labels, employing standard pixel-wise cross-entropy.

6. Empirical Results and Comparative Analysis

CPF-CTE has been evaluated extensively on PASCAL VOC 2012 and MS COCO 2014, with results summarized in the following tables.

Pseudo-label Quality on VOC2012 train (mIoU %):

Method	mIoU
SEAM	63.6
AFA	66.0
PGSD	68.7
CGM	68.1
CPF-CTE w/o CRF	67.1
CPF-CTE	70.8

Final Segmentation on VOC2012 val (mIoU %):

Method	mIoU
AFA	63.8
PGSD	68.7
CGM	67.8
CPF-CTE	69.5

Segmentation on COCO2014 val (mIoU %):

Method	mIoU
AFA	38.9
SAS	44.5
FPR	43.9
ToCo	42.3
PGSD	43.5
CGM	40.1
CPF-CTE	45.4

Ablation on VOC2012 val (mIoU %):
- ViT only: 65.2
- + class tokens: 67.1
- + context enhancement: 67.8
- Full CPF-CTE: 69.5
Pooling Strategy Comparison (VOC val mIoU %):
- Average: 67.7
- Max: 68.1
- Top-k (k=4): 69.5

CPF-CTE consistently achieves more complete activations, sharper boundaries, and enhanced performance on small or occluded classes. Gains are additive with both CF-BiLSTM and class token modules, as demonstrated by ablation results.

7. Synthesis and Significance

CPF-CTE integrates (1) a ViT backbone for rich global context, (2) a lightweight CF-BiLSTM for bidirectional spatial continuity, and (3) learnable class tokens for dynamic class-discriminative semantics. This architectural synthesis supports end-to-end weakly supervised segmentation and delivers state-of-the-art pseudo-labeling (70.8% mIoU on VOC train) and strong final segmentation (69.5% VOC val, 45.4% COCO val). Both quantitative and qualitative results attest to its ability to address key limitations of traditional patch-based and class-agnostic approaches, demonstrating efficacy across varying object sizes and occlusion settings. The ablation studies confirm the complementary and significant contributions of context fusion and class token enhancement (Fu et al., 21 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Context Patch Fusion With Class Token Enhancement for Weakly Supervised Semantic Segmentation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Context Patch Fusion with Class Token Enhancement (CPF-CTE).