CDFormer: Cross-Domain Few-Shot Object Detection Transformer Against Feature Confusion

Published 2 May 2025 in cs.CV and cs.AI | (2505.00938v1)

Abstract: Cross-domain few-shot object detection (CD-FSOD) aims to detect novel objects across different domains with limited class instances. Feature confusion, including object-background confusion and object-object confusion, presents significant challenges in both cross-domain and few-shot settings. In this work, we introduce CDFormer, a cross-domain few-shot object detection transformer against feature confusion, to address these challenges. The method specifically tackles feature confusion through two key modules: object-background distinguishing (OBD) and object-object distinguishing (OOD). The OBD module leverages a learnable background token to differentiate between objects and background, while the OOD module enhances the distinction between objects of different classes. Experimental results demonstrate that CDFormer outperforms previous state-of-the-art approaches, achieving 12.9% mAP, 11.0% mAP, and 10.4% mAP improvements under the 1/5/10 shot settings, respectively, when fine-tuned.

Abstract PDF Upgrade to Chat

Summary

The paper introduces CDFormer, a transformer-based framework that mitigates object-background and object-object feature confusion in few-shot object detection.
It leverages a learnable background token and contrastive learning, achieving up to 12.9 mAP gains in challenging cross-domain settings.
The architecture supports flexible meta-learning and practical deployment by efficiently adapting to diverse visual domains without extensive fine-tuning.

CDFormer: Architecture and Implications for Cross-Domain Few-Shot Object Detection

The paper "CDFormer: Cross-Domain Few-Shot Object Detection Transformer Against Feature Confusion" (2505.00938) targets the fundamental problem of feature confusion in cross-domain few-shot object detection (CD-FSOD). The authors introduce an end-to-end transformer architecture, CDFormer, which systematically addresses the two main types of feature confusion—object-background and object-object—using two dedicated modules. This work not only advances the state of the art in CD-FSOD but also suggests generalizable strategies for meta-learning and transformer-based detection pipelines.

Problem Formulation and Background

CD-FSOD entails detecting previously unseen object categories in new domains given only a handful of labeled samples per class. The compound difficulties arise from limited data per class (few-shot) and distributional shift (cross-domain). Existing few-shot detectors, while effective on in-domain benchmarks, degrade substantially when deployed on target domains with visual styles divergent from the training set. The authors identify two primary sources of error:

Object-Background Confusion: Model fails to separate ambiguous object boundaries from backgrounds, especially prominent in domains with atypical scenery (e.g., underwater or artwork datasets).
Object-Object Confusion: Semantically similar but distinct classes are not adequately separated, leading to misclassification.

Previous solutions (e.g., CD-ViTO) tackled these issues via hand-crafted feature reweighting and direct feature editing, approaches that lack adaptability and can decrease semantic alignment between support and query features.

CDFormer Architecture

CDFormer is designed as a single-stage, transformer-based detection framework with two novel and orthogonal contributions:

Object-Background Distinguishing (OBD) Module: Introduces a learnable background token that refines feature representations to explicitly separate object and background signals.

Object Feature Enhancement (OFE) Unit is applied to both support and query branches. The support set leverages the background token to segregate class and background features; the query branch uses it to accentuate alignment with true object classes.
Background Feature Learning (BFL) Unit enforces explicit supervision of the background embeddings, employing zero vectors for non-object regions and including background predictions in the detection head's outputs.

This mechanism enables the model to decouple relevant object regions from diverse, domain-specific backgrounds during both training and inference.

Object-Object Distinguishing (OOD) Module: Applies contrastive learning at the detection head, leveraging an InfoNCE objective between learned class embeddings and support-set features. By maximizing the mutual information between true pairs and minimizing it for negative pairs, the OOD module increases inter-class distance and reduces misclassification due to feature proximity of similar classes.

The integration of these modules within a DETR/Deformable-DETR style backbone affords the method architectural simplicity and position-agnostic matching capabilities, which are critical for few-shot generalization.

Redefining the Detection Head

CDFormer incorporates background placeholders for unknown class cardinalities and outputs class probabilities for each query proposal via a sigmoid-based detection head. This mapping supports arbitrary-shot, multi-way detection—even when test-time class sets differ notably from training.

Implementation and Practical Considerations

Key practical aspects for real-world use include:

Pretraining and Adaptation: The model is pretrained on COCO and fine-tuned on each target domain using k-shot episodes. The framework is agnostic to the support set composition and number of target classes, adjustable at runtime via the support input.
Computation: The use of class-agnostic transformers and attention modules introduces additional memory overhead, primarily determined by the number of support examples and query region patches. However, single-stage design eschews the need for unreliable cross-domain region proposals, a consistent weakness of RPN-based methods.
Loss Functions: InfoNCE loss for OOD, standard classification and localization losses (with background class) for OBD.
Generalization: Ablation studies reveal that OBD and OOD independently and jointly improve detection, particularly in cases of severe feature confusion (NEU-DET, UODD). Without fine-tuning, CDFormer substantially outperforms existing non-fine-tuning baselines.

Empirical Results

CDFormer achieves notable improvements of 12.9, 11.0, and 10.4 mAP over the previous state of the art in 1-shot, 5-shot, and 10-shot cross-domain settings, respectively. Gains are most pronounced in difficult out-of-domain datasets where previous methods fail to resolve background ambiguities or class similarity-induced errors—highlighted by confusion matrices and qualitative visualizations.

Theoretical and Practical Implications

Generalizability: By employing learnable background representations and contrastive objectives for inter-class separation, CDFormer operates robustly even on domains far from the source training set, demonstrating the utility of modular latent representations over hand-crafted or directly manipulated features.
Deployment: The method’s single-stage design, reliance on meta-input (support set), and its runtime flexibility (variable classes, shots) make it well-suited for real-world deployments—particularly for robotics, remote sensing, or any context where labeled data is scarce and domain shifts are natural.
Module Transferability: The OBD and OOD modules embody general principles (explicit background modeling, contrastive inter-class separation) that can be ported to other detection architectures, suggesting new directions for domain-agnostic perception models.

Directions for Future Research

Unsupervised Domain Adaptation: Extending the OBD/OOD paradigm to fully unsupervised settings where not even few-shot labels are available per target class.
Non-Visual Modalities: Applying similar mechanisms to multi-modal perception systems (e.g., vision-language or sensor fusion), leveraging learnable tokens for modality-specific noise or background signals.
Efficient Support Set Selection: Dynamic selection and weighting of support samples to further enhance domain adaptation in resource-limited scenarios.
Scalability: Investigation of scalability to higher-way, higher-shot regimes and extremely large-scale open-vocabulary detection.

Conclusion

CDFormer provides an effective and efficient solution for cross-domain few-shot detection by directly targeting the sources of feature confusion with architectural and loss-based innovations. Its superiority in both fine-tuning and non-fine-tuning settings, and its applicability to diverse target domains, position it as a reference point for future meta- and cross-domain detection research, as well as a practical model for challenging real-world detection problems.

Markdown Report Issue