Papers
Topics
Authors
Recent
Search
2000 character limit reached

CDFormer: Cross-Domain Few-Shot Object Detection Transformer Against Feature Confusion

Published 2 May 2025 in cs.CV and cs.AI | (2505.00938v1)

Abstract: Cross-domain few-shot object detection (CD-FSOD) aims to detect novel objects across different domains with limited class instances. Feature confusion, including object-background confusion and object-object confusion, presents significant challenges in both cross-domain and few-shot settings. In this work, we introduce CDFormer, a cross-domain few-shot object detection transformer against feature confusion, to address these challenges. The method specifically tackles feature confusion through two key modules: object-background distinguishing (OBD) and object-object distinguishing (OOD). The OBD module leverages a learnable background token to differentiate between objects and background, while the OOD module enhances the distinction between objects of different classes. Experimental results demonstrate that CDFormer outperforms previous state-of-the-art approaches, achieving 12.9% mAP, 11.0% mAP, and 10.4% mAP improvements under the 1/5/10 shot settings, respectively, when fine-tuned.

Summary

  • The paper introduces CDFormer, a transformer-based framework that mitigates object-background and object-object feature confusion in few-shot object detection.
  • It leverages a learnable background token and contrastive learning, achieving up to 12.9 mAP gains in challenging cross-domain settings.
  • The architecture supports flexible meta-learning and practical deployment by efficiently adapting to diverse visual domains without extensive fine-tuning.

CDFormer: Architecture and Implications for Cross-Domain Few-Shot Object Detection

The paper "CDFormer: Cross-Domain Few-Shot Object Detection Transformer Against Feature Confusion" (2505.00938) targets the fundamental problem of feature confusion in cross-domain few-shot object detection (CD-FSOD). The authors introduce an end-to-end transformer architecture, CDFormer, which systematically addresses the two main types of feature confusion—object-background and object-object—using two dedicated modules. This work not only advances the state of the art in CD-FSOD but also suggests generalizable strategies for meta-learning and transformer-based detection pipelines.

Problem Formulation and Background

CD-FSOD entails detecting previously unseen object categories in new domains given only a handful of labeled samples per class. The compound difficulties arise from limited data per class (few-shot) and distributional shift (cross-domain). Existing few-shot detectors, while effective on in-domain benchmarks, degrade substantially when deployed on target domains with visual styles divergent from the training set. The authors identify two primary sources of error:

  • Object-Background Confusion: Model fails to separate ambiguous object boundaries from backgrounds, especially prominent in domains with atypical scenery (e.g., underwater or artwork datasets).
  • Object-Object Confusion: Semantically similar but distinct classes are not adequately separated, leading to misclassification.

Previous solutions (e.g., CD-ViTO) tackled these issues via hand-crafted feature reweighting and direct feature editing, approaches that lack adaptability and can decrease semantic alignment between support and query features.

CDFormer Architecture

CDFormer is designed as a single-stage, transformer-based detection framework with two novel and orthogonal contributions:

  1. Object-Background Distinguishing (OBD) Module: Introduces a learnable background token that refines feature representations to explicitly separate object and background signals.
  • Object Feature Enhancement (OFE) Unit is applied to both support and query branches. The support set leverages the background token to segregate class and background features; the query branch uses it to accentuate alignment with true object classes.
  • Background Feature Learning (BFL) Unit enforces explicit supervision of the background embeddings, employing zero vectors for non-object regions and including background predictions in the detection head's outputs.

This mechanism enables the model to decouple relevant object regions from diverse, domain-specific backgrounds during both training and inference.

  1. Object-Object Distinguishing (OOD) Module: Applies contrastive learning at the detection head, leveraging an InfoNCE objective between learned class embeddings and support-set features. By maximizing the mutual information between true pairs and minimizing it for negative pairs, the OOD module increases inter-class distance and reduces misclassification due to feature proximity of similar classes.

The integration of these modules within a DETR/Deformable-DETR style backbone affords the method architectural simplicity and position-agnostic matching capabilities, which are critical for few-shot generalization.

Redefining the Detection Head

CDFormer incorporates background placeholders for unknown class cardinalities and outputs class probabilities for each query proposal via a sigmoid-based detection head. This mapping supports arbitrary-shot, multi-way detection—even when test-time class sets differ notably from training.

Implementation and Practical Considerations

Key practical aspects for real-world use include:

  • Pretraining and Adaptation: The model is pretrained on COCO and fine-tuned on each target domain using k-shot episodes. The framework is agnostic to the support set composition and number of target classes, adjustable at runtime via the support input.
  • Computation: The use of class-agnostic transformers and attention modules introduces additional memory overhead, primarily determined by the number of support examples and query region patches. However, single-stage design eschews the need for unreliable cross-domain region proposals, a consistent weakness of RPN-based methods.
  • Loss Functions: InfoNCE loss for OOD, standard classification and localization losses (with background class) for OBD.
  • Generalization: Ablation studies reveal that OBD and OOD independently and jointly improve detection, particularly in cases of severe feature confusion (NEU-DET, UODD). Without fine-tuning, CDFormer substantially outperforms existing non-fine-tuning baselines.

Empirical Results

CDFormer achieves notable improvements of 12.9, 11.0, and 10.4 mAP over the previous state of the art in 1-shot, 5-shot, and 10-shot cross-domain settings, respectively. Gains are most pronounced in difficult out-of-domain datasets where previous methods fail to resolve background ambiguities or class similarity-induced errors—highlighted by confusion matrices and qualitative visualizations.

Theoretical and Practical Implications

  • Generalizability: By employing learnable background representations and contrastive objectives for inter-class separation, CDFormer operates robustly even on domains far from the source training set, demonstrating the utility of modular latent representations over hand-crafted or directly manipulated features.
  • Deployment: The method’s single-stage design, reliance on meta-input (support set), and its runtime flexibility (variable classes, shots) make it well-suited for real-world deployments—particularly for robotics, remote sensing, or any context where labeled data is scarce and domain shifts are natural.
  • Module Transferability: The OBD and OOD modules embody general principles (explicit background modeling, contrastive inter-class separation) that can be ported to other detection architectures, suggesting new directions for domain-agnostic perception models.

Directions for Future Research

  • Unsupervised Domain Adaptation: Extending the OBD/OOD paradigm to fully unsupervised settings where not even few-shot labels are available per target class.
  • Non-Visual Modalities: Applying similar mechanisms to multi-modal perception systems (e.g., vision-language or sensor fusion), leveraging learnable tokens for modality-specific noise or background signals.
  • Efficient Support Set Selection: Dynamic selection and weighting of support samples to further enhance domain adaptation in resource-limited scenarios.
  • Scalability: Investigation of scalability to higher-way, higher-shot regimes and extremely large-scale open-vocabulary detection.

Conclusion

CDFormer provides an effective and efficient solution for cross-domain few-shot detection by directly targeting the sources of feature confusion with architectural and loss-based innovations. Its superiority in both fine-tuning and non-fine-tuning settings, and its applicability to diverse target domains, position it as a reference point for future meta- and cross-domain detection research, as well as a practical model for challenging real-world detection problems.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.