Omni-DETR: Unified Omni-Supervised Object Detection
- Omni-DETR is a unified object detection framework that integrates fully labeled, weakly labeled, and unlabeled data to enhance performance.
- It uses a dual-network student-teacher transformer architecture with EMA updates and a novel bipartite matching-based pseudo-label filtering mechanism.
- Experimental results demonstrate significant mAP improvements on benchmarks like COCO and VOC while optimizing the annotation cost-accuracy trade-off.
Omni-DETR is a unified end-to-end object detection framework that enables omni-supervised learning, leveraging not only fully labeled data but also a spectrum of weak annotations—including image tags, counts, points, and various forms of bounding boxes—and unlabeled data, within a transformer-based architecture. It integrates advances in end-to-end detection with transformers (specifically, Deformable DETR) and mean-teacher-style semi-supervised learning, using a novel bipartite matching-based pseudo-labeling mechanism. This approach enables state-of-the-art detection performance and optimizes the accuracy–annotation cost trade-off across diverse dataset regimes (Wang et al., 2022).
1. Unified Model Architecture
Omni-DETR is structured as a dual-network system comprising a “student” detector and a “teacher” detector , both instantiated as Deformable DETR variants with a ResNet-50 backbone, transformer encoder and decoder, and object queries. Key architectural details include:
- Backbone and Feature Extraction: Input images are encoded via a ResNet-50 pretrained on ImageNet. The resultant feature maps, augmented with positional encodings, are input to a multi-head transformer encoder.
- Transformer Decoder and Query Mechanism: A fixed set of learned object queries are cross-attended with encoded features to produce per-query embeddings, each yielding a -way classification logit vector and a 4D bounding box regression .
- Student–Teacher Framework: The student network is trained via stochastic gradient descent (SGD) using both ground truth and pseudo-labeled examples. The teacher network is updated as an exponential moving average (EMA) of the student’s weights, with update rate , and generates pseudo-labels on weakly-augmented images for supervision of the student, which receives strongly-augmented inputs.
2. Omni-Supervised Learning Paradigm
Omni-DETR generalizes the supervision signal to accept a mixture of:
- Fully labeled examples: , with standard bounding box and class label annotations.
- Omni-labeled (weakly labeled) examples: , where can be:
- None (), supporting standard semi-supervised detection.
- TagsU (image-level class tags), TagsK (tags with counts), PointsU (points), PointsK (points with class), BoxesU (unlabeled boxes), BoxesEC (extreme-click boxes).
The workflow involves producing two augmentations for each omni-labeled image: weak (flip) for the teacher; strong (flip + crop + color) for the student. The teacher generates detection hypotheses which are filtered, given weak labels, to produce pseudo-labels. The student is supervised to match these pseudo-labels on the strong view.
3. Bipartite Matching-Based Pseudo-Label Filtering
Selection of pseudo-labels from the teacher’s detections is formulated as a global bipartite matching (Hungarian algorithm) between the weak ground truth “targets” and the teacher predictions. This mechanism enables a principled and label-type-agnostic filtering of teacher outputs, outperforming hand-crafted heuristics. The assignment problem is defined as:
where contains predicted class probabilities and bounding boxes. The matching cost is adapted to the weak-label type:
- None (Unlabeled): Filter detections by confidence threshold .
- TagsU, TagsK: Cost depends on , selecting highest-confidence boxes per class tag (with or without known counts).
- PointsU, PointsK: Cost blends normalized Euclidean distance, prediction confidence, and class information.
- BoxesU, BoxesEC: Cost leverages the DETR box loss with class inferred from teacher probabilities.
After assignment, pseudo-labels are constructed, utilizing either the ground truth class or the teacher’s prediction depending on label type.
4. Supervision, Losses, and Optimization
The student is optimized to minimize the sum of detection losses over three supervision streams:
Each detection loss decomposes as:
with being the focal classification loss and consisting of gIoU plus terms, in line with DETR conventions. No additional consistency or distillation term is used; the teacher evolves solely via EMA.
5. Experimental Findings and Performance Metrics
Empirical validation demonstrates the efficacy of omni-supervised training:
- COCO-standard with 10% labeled images: Baseline supervised mAP: 28.0. Addition of 90% unlabeled (SSOD) yields +4.4 mAP, while incorporating weak labels yields further gains:
- TagsU: 34.7 (+6.7)
- TagsK: 35.2 (+7.2)
- PointsU: 34.1 (+6.1)
- PointsK: 35.7 (+7.7)
- BoxesEC: 36.4 (+8.4)
- BoxesU: 36.8 (+8.8)
- Comparison to semi-supervised SOTA: On COCO with 5% labels, Omni-DETR achieves 30.2 mAP, exceeding Unbiased Teacher’s 28.3; on VOC0712, 53.4 AP vs 48.7.
- Weakly supervised object detection (tags/points): With 35K COCO tag-only data, Omni-DETR achieves a +5.1 mAP gain (34.339.4) vs UFO²’s +0.3. Point+tag further increases gains.
Annotation–cost analysis across datasets (Bees, CrowdHuman, COCO, etc.) shows that mixtures of weak labels (e.g., extreme-click boxes and points) plus a small number of full labels consistently outperform using the same budget for only full annotation. For example, a 52% mAP on Bees is achieved at lower annotation effort by mixing 10% full, 46% TagsK, and 44% BoxesEC labels, compared to full annotation.
6. Principal Insights and Implications
Key findings include:
- Any form of weak annotation distinctly improves detection over pure unlabeled SSOD, even with a strong Deformable DETR baseline.
- High-quality full box labels (BoxesU) yield maximum performance increases; extreme-click boxes (BoxesEC) achieve nearly equivalent accuracy (0.3 mAP lower) at a fraction (1/5) of the annotation cost.
- Tags with counts (TagsK) outperform plain tags (TagsU); similarly, PointsK surpass PointsU.
- The bipartite matching pseudo-label filter yields consistent gains compared to ad-hoc filtering per label type.
- Mixtures of minimal full annotation with abundant weak labels yield strictly superior accuracy–cost trade-offs compared to full annotation alone.
- The ideal mix is dataset-dependent: points are notably effective and cost-efficient for crowded small-object datasets; image tags are less beneficial as class vocabulary size increases.
Omni-DETR illustrates the feasibility and effectiveness of unifying diverse annotation types in a single transformer-based detector, leveraging a principled global matching strategy for pseudo-labeling, and optimizing both detection accuracy and annotation efficiency across benchmarks (Wang et al., 2022).