Towards Data-Efficient Detection Transformers
This paper addresses the critical issue of data efficiency in detection transformers, which have been lauded for their effectiveness on large datasets like COCO but suffer from significant performance drops on smaller datasets such as Cityscapes. The authors empirically identify key factors that contribute to the data inefficiency of detection transformers through a detailed analysis conducted via a transformation from a Sparse RCNN, known for its data efficiency, to the representative DETR model. Their findings suggest that the sparse feature sampling from local image areas is crucial in mitigating the data-hungry nature of detection transformers.
The authors propose a straightforward yet impactful modification to existing detection transformers by altering how key and value sequences are constructed within the cross-attention layer of the transformer decoder. This is achieved with minor changes to the original models, alongside the introduction of a novel label augmentation method designed to provide richer supervision and thus improve data efficiency.
Key Findings and Contributions
Data Efficiency Problem Identification: The authors document a stark performance contrast between detection transformers and CNN-based object detectors like Faster-RCNN on small datasets. They highlight that existing detection transformers are generally data-hungry, which is detrimental given the resources required to curate large datasets.
Empirical Analysis through Model Transition: By incrementally transforming a Sparse RCNN to a DETR, the study isolates factors that affect data efficiency:
- Sparse feature sampling from local image regions.
- The utilization of multi-scale features made feasible through sparse sampling.
- Making predictions relative to initial spatial priors, which seems to avoid extensive learning of locality from data.
Proposed Solutions:
- Sparse Feature Sampling: The paper presents a method that samples features based on predicted bounding boxes and integrates these within the decoder, allowing for enriched data context while maintaining minimal model alterations.
- Multi-scale Feature Incorporation: By sampling from multi-scale features, the modified detection transformers leverage additional context without excessive computational cost.
- Label Augmentation Strategy: The authors enhance supervision by repeating positive labels, thereby enriching the training signal for detection transformers.
Experimental Validation: Extensive experiments demonstrate that the proposed modifications significantly enhance the performance of detection transformers on small datasets and maintain, if not improve, performance on larger datasets like COCO. Specifically, the proposed methods achieve substantial performance gains on the Cityscapes dataset while also showing improved efficiency on sub-sampled COCO datasets.
Implications and Future Directions
The implications of this research are manifold. Practically, the ability to reduce the data demands of detection transformers expands their usability in real-world applications where data is scarce or costly to annotate. Theoretically, this work furthers our understanding of transformer architectures in the vision domain, providing insights into the integration of inductive biases commonly used in CNNs into transformer-based models.
Looking forward, the exploration of data-efficient architectures holds promise for transforming diverse applications across AI. Future studies could explore the extension of these principles to other vision tasks, such as segmentation or 3D object detection, potentially leading to a broader paradigm shift in how transformers are designed and trained. Additionally, these findings may spark interest in devising new architecture designs that inherently account for data efficiency without relying on extensive pre-training.