Colonoscopic Polyp Re-Identification

Updated 1 January 2026

Colonoscopic polyp re-identification is a process that identifies and tracks the same polyp across multiple frames and sessions using advanced imaging techniques.
It leverages self-supervised contrastive learning, meta-learning, and multimodal fusion to overcome challenges such as abrupt camera motion, non-rigid deformation, and domain shift.
By enabling accurate polyp counting and improving metrics like ADR and PPC, this approach supports automated clinical workflows and reliable reporting.

Colonoscopic polyp re-identification (polyp ReID) is a technical task within computer-aided diagnosis (CADx) that seeks to match instances of the same physical polyp across different colonoscopy frames, views, cameras, and even endoscopic sessions. This process is critical for precise reporting, longitudinal tracking, and improved optical biopsy. Rigorous polyp ReID supports reliable polyp counting, enhances clinical workflow automation, and underpins metrics such as Adenoma Detection Rate (ADR) and Polyps Per Colonoscopy (PPC). Recent advances leverage self-supervised and multimodal deep learning strategies, adapting methods from generic object/person ReID while addressing domain-specific constraints of medical imaging.

1. Core Problem Formulation and Technical Challenges

Polyp ReID operates over sets of polyp "tracklets"—temporal groupings of bounding box detections from colonoscopy videos. The task is to determine whether two tracklets $T^a, T^b$ correspond to the same polyp identity. Unique challenges include:

Abrupt camera motion and frequent “re-entry”: Polyps leave and re-enter the field of view due to navigation.
Non-rigid deformation and occlusion: The tissue’s flexibility and endoscope tools introduce visual changes.
High intra-class variation vs. low inter-class variation: Polyps within a patient may have similar appearance; a single polyp may look different over time, illumination, and angle (Intrator et al., 2023).
Scarcity of annotated polyp-pairing labels: Manual assignment is labor-intensive and ethically constrained (Xiang et al., 2023).
Domain shift: Differing imaging devices and patient populations cause style discrepancies.

Polyp ReID underpins clinical computations (polyp counting, ADR/PPC), downstream CADx (optical diagnosis, reporting), and explainable real-time retrieval (Yang et al., 2024, Yang et al., 23 Jul 2025).

2. Representation Learning Paradigms for Polyp ReID

Modern polyp ReID comprises numerous representation learning strategies, ranging from unimodal self-supervised pipelines to sophisticated multimodal meta-learning approaches.

2.1 Self-Supervised Contrastive Learning

Self-supervised paradigms eschew explicit polyp-pair annotation by exploiting intra-tracklet redundancy. Representative methods include:

Single-frame encoder (SFE): Each cropped polyp frame is mapped via a ResNet backbone with an MLP to a $d$ -dim embedding. Contrastive loss is applied over random frame pairs from the same tracklet, treating other frames as negatives (Intrator et al., 2023).
Multi-view encoder (MVE): A transformer stack ingests multiple frame embeddings, fusing temporal data. The CLS token embedding contextualizes cross-frame features, outperforming simple mean/max pooling (Intrator et al., 2023, Parolari et al., 14 Feb 2025).
Masked autoencoder (MAE) + contrastive fusion: ViT-based polyp-aware encoders merge masked reconstruction loss—foreground-focused via segmentation masks—with InfoNCE (contrastive) and entropy-based repulsion, optimizing discriminative embedding spaces (Yang et al., 2024, Yang et al., 23 Jul 2025).

2.2 Meta-Learning Approaches

Meta-learning addresses domain gaps and intra-class discrepancy through episodic training:

Episodic paradigm: Randomly split batches into meta-train (support) and meta-test (query) sets; inner loop optimizes on support, outer loop computes gradients over query (Xiang et al., 2023).
Meta-Learning Regulation (MLR): Style mixing via sampled batch-norm statistics perturbs query features, increasing robustness to domain shift (Xiang et al., 2023).
Combined cross-entropy and hard triplet losses: Anchor-positive-negative triplet mining within episodic batches boosts discriminative separation (Xiang et al., 2023).
Colo-ReID: Demonstrates state-of-the-art results on the Colo-Pair benchmark, with mAP=74.4%, Rank-1=77.5% (Xiang et al., 2023).

2.3 Multimodal and Fusion Models

Recent work exploits complementary information, notably from textual modalities:

Dual-stream fusion: Visual transformer encoder (ViT or ResNet) is fused with a text encoder (BERT or ALBERT), typically involving clustering heads and cross-modal self-attention modules (Xiang et al., 2023, Xiang et al., 2024, Xiang et al., 25 Dec 2025).
Gated Progressive Fusion (GPF-Net): Layer-wise Gates adaptively weight visual/textual features, with successive self-attention blocks mediating progressive refinement (Xiang et al., 25 Dec 2025).
Dynamic Multimodal Collaborative Learning (DMCL): Projects visual and text features to a joint embedding, fusing via stacked multi-head attention for end-to-end supervision (Xiang et al., 2024).
Visual-text clustering (VT-ReID): Momentum contrast and instance-level clustering regularize the visual manifold using semantic structure derived from textual reports (Xiang et al., 2023).

3. Database Retrieval, Indexing, and Decision Mechanisms

Polyp ReID ultimately translates to retrieval in a database of historic polyp representations:

Hash-based indexing: Feature embedding vectors (from encoder/output fusion blocks) are quantized to binary codes for ball-tree (Hamming distance) or cosine-space (raw vector) nearest-neighbor queries, supporting sub-millisecond lookup even at 10⁵–10⁶ images (Yang et al., 2024, Yang et al., 23 Jul 2025).
Majority voting over nearest neighbors: For diagnostic classification, retrieved cases’ clinical labels inform the query decision via argmax over label counts—a “digital twin” paradigm (Yang et al., 2024).
Clustering for counting: Affinity propagation or agglomerative clustering groups tracklets into distinct polyp entities, with fragmentation rate (clusters/polyps) as a key metric (Parolari et al., 14 Feb 2025).
Scene representation transformer fusion: Multiple implicit 3D views are fused into a single latent representation, discretized via a hashing layer (Yang et al., 23 Jul 2025).

4. Quantitative Benchmarks and Technical Outcomes

Multiple benchmarks and metrics rigorously characterize method performance:

Method	Benchmark	mAP	Rank-1	Rank-5	Notable Outcome
EndoFinder-Raw/Hash	Polyp-Twin	0.695	0.693	0.495	Competitive with supervised models
GPF-Net	Colo-Pair	68.9%	80.2%	89.6%	Multimodal gated fusion SOTA
DMCL	Colo-Pair	46.4%	54.3%	57.9%	Multimodal attention
VT-ReID	Colo-Pair	37.9%	23.4%	44.5%	Semantic clustering effect
Colo-ReID (meta-learn)	Colo-Pair	74.4%	77.5%	85.8%	Meta-learning with regulation
MVE + Affinity Prop	REAL-Colon	6.30 (FR)	—	—	Lowest fragmentation for counting

All methods report improvements over prior unimodal or pooled approaches. Early fusion (multi-frame embedding via transformer attention) consistently exceeds late fusion (mean/max pooling). Multimodal pipelines increase discriminative power especially in challenging clinical scenarios (Xiang et al., 2024, Xiang et al., 25 Dec 2025).

5. Explainability, Clinical Integration, and Practical Impact

Case-based reasoning—retrieving visually and pathologically annotated “digital twins”—enhances endoscopist trust and provides interpretable support for second-look decision-making (Yang et al., 2024, Yang et al., 23 Jul 2025).

Real-time performance: Hash-based retrieval achieves 108 FPS (EndoFinder-Hash), compatible with live video (Yang et al., 2024).
Scalability: Systems scale efficiently to hundreds of thousands of images; continual indexing integrates newly annotated cases (Yang et al., 2024).
Explainability: Digital twin retrieval displays annotated historical samples, in contrast to black-box classification models (Yang et al., 23 Jul 2025).
Integration: On-premise deployment without dependence on network/cloud, facilitating dynamic database updates and clinical adoption (Yang et al., 2024).

6. Current Limitations and Future Research Directions

Limitations across the literature include:

Database coverage limitations: Rare, out-of-distribution, or ambiguous polyps may lack adequate historical twins (Yang et al., 2024).
Majority-vote inference fragility: Fails in highly ambiguous cases or when erroneous retrieval occurs (Yang et al., 2024).
Text annotation dependence: Multimodal models require textual metadata, which may be missing/understructured (Xiang et al., 25 Dec 2025, Xiang et al., 2024).
Still-image restriction: Few approaches fully exploit temporal video-sequence context; ongoing work seeks to extend to video-level retrieval (Yang et al., 2024).
Segmentation mask requirement: Mask errors propagate into self-supervised pretraining (Yang et al., 23 Jul 2025).
Attention-map interpretability: Fusion models require deeper analysis of gating/attention weights to understand modality reliance (Xiang et al., 25 Dec 2025).

Future enhancements proposed include adaptive $K$ selection or weighting in retrieval, multimodal inclusion (location, clinical, chromoendoscopy), continual indexing, unstructured text mining, metric learning variants (e.g., quadruplet loss), and weakly supervised segmentation (Yang et al., 2024, Xiang et al., 25 Dec 2025, Yang et al., 23 Jul 2025).

7. Perspectives and Methodological Connections

Colonoscopic polyp re-identification now draws on and extends state-of-the-art methods from person/object ReID, contrastive pretraining, transformer-based fusion, meta-learning, and clustering. Domain-robust and explainable retrieval increasingly overtakes supervised classification in technical viability, translational deployment, and clinical impact (Yang et al., 2024, Intrator et al., 2023).

A plausible implication is that polyp ReID workflows and frameworks—such as EndoFinder, DMCL, GPF-Net, and Colo-ReID—constitute foundational infrastructure for next-generation automated reporting, quality control, and precision endoscopic oncology. The ongoing fusion of visual, textual, and potentially geometric modalities continues to drive progress in this technically complex, clinically urgent field.