Anchor DETR: Interpretable Object Detection
- The paper introduces an explicit anchor-point query mechanism that replaces learned queries in DETR, enhancing optimization and interpretability.
- The method uses Row-Column Decoupled Attention to reduce memory complexity while maintaining competitive accuracy and fast inference speeds.
- Experimental results on the COCO dataset demonstrate that combining anchor-based queries with RCDA yields significant improvements in average precision and training efficiency.
Anchor DETR is a transformer-based object detection framework that introduces an interpretable, anchor-point-driven query formulation for the DETR family of models. By replacing the opaque, learned object queries of the original DETR with explicit anchor point queries—paralleling strategies from CNN-based detectors—Anchor DETR achieves enhanced optimization, faster convergence, and competitive detection accuracy with reduced computational overhead. This approach establishes a bridge between anchor-based detection paradigms and end-to-end transformer detectors, and has influenced subsequent generations of DETR-like models.
1. Anchor-Based Query Design in DETR
Anchor DETR reformulates the object query design in the transformer decoder to embed explicit spatial priors, which allows each query to focus on a physically-defined region in the image. Each anchor point is a normalized two-dimensional coordinate: where denotes the number of anchor points. Each anchor point is encoded to a query-position embedding via a two-layer MLP applied to the standard sine–cosine encoding: where is the model dimension. To support detection of multiple objects at a single location ("one region, multiple objects"), learnable pattern embeddings are introduced and tiled across all anchors: Initial decoder queries are the sum (Wang et al., 2021).
This formulation ensures that each object query has an explicit spatial focus, improving optimization by imparting a clearer division of labor to the prediction slots and more direct interpretability.
2. Row-Column Decoupled Attention (RCDA)
Anchor DETR introduces Row-Column Decoupled Attention (RCDA) as an efficient variant of standard multi-head attention to address the memory bottleneck in high-resolution feature processing. RCDA factorizes attention computation into separate row and column interactions:
- Given feature , average along rows and columns to get and .
- Row attention: .
- Intermediate result: .
- Column attention: , final output is summed accordingly.
RCDA reduces the memory complexity from (standard) to , substantially saving memory especially at high feature map resolutions (Wang et al., 2021).
3. Loss Functions and Training Regime
Anchor DETR adopts the same set-based Hungarian matching paradigm as the original DETR for bipartite assignment between predictions and ground-truth targets. The cost and total loss combine focal classification loss, box regression loss, and GIoU loss:
Typical weights are , , (Wang et al., 2021).
Training is performed for 50 epochs on COCO with AdamW and ResNet-50-DC5 as backbone, using standard data augmentations and learning rate scheduling.
4. Experimental Performance and Ablation Studies
On the COCO dataset, Anchor DETR attains strong metrics while training with 10× fewer epochs compared to DETR. For single-scale setups with ResNet-50-DC5:
| Method | AP | AP | AP | FPS |
|---|---|---|---|---|
| DETR (500 ep) | 43.3 | 63.1 | 45.9 | 12 |
| Deformable DETR (50 ep) | 43.8 | 62.6 | 47.7 | 15 |
| SMCA (50 ep) | 43.7 | 63.6 | 47.2 | 10 |
| Anchor DETR (50 ep) | 44.2 | 64.7 | 47.5 | 19 |
Ablation reveals:
- Anchor-based queries alone contribute AP;
- RCDA adds AP;
- Using both yields the highest AP and best speed. Grid vs. learned anchor initialization yields nearly identical AP (44.1 vs. 44.2) (Wang et al., 2021).
5. Comparative Analysis and Influence on Successors
Anchor DETR marks a pivotal shift by rendering the queries interpretable and physically grounded. Its anchor point formulation and RCDA were adopted and extended in subsequent work:
- DAB-DETR generalizes to dynamic box queries, updating anchor box parameters layer by layer while interpreting their cross-attention as soft ROI pooling (Liu et al., 2022).
- Conditional DETR V2 formalizes "box queries" as embeddings of a reference point and learned box transform, explicitly drawing a parallel to anchor box refinement in Faster R-CNN (Chen et al., 2022).
- Box-DETR further replaces box centers in conditional queries with head-specific agent points encompassing full box information for each cross-attention head, accelerating convergence and improving AP (Liu et al., 2023).
These descendant models empirically show further improvements in both training efficiency and detection AP, substantiating the value of the anchor-inspired query mechanism introduced by Anchor DETR.
6. Practical Impact and Implementation Considerations
Anchor DETR achieves end-to-end, anchor-free inference; no non-maximum suppression or region assignment modules are required at test time. The inference pipeline remains as in DETR: all predictions are directly interpretable as detection candidates, assigned via bipartite matching, and scored by a unified loss. Implementation requires only minimal deviation from standard DETR, with public code and trained models available at https://github.com/megvii-research/AnchorDETR (Wang et al., 2021).
Anchor DETR demonstrates that explicit anchor-point queries can reconcile the optimization advantages of second-stage CNN detectors with the global, permutation-invariant processing of transformers, resulting in faster training, interpretable queries, and state-of-the-art transformer detection performance with lower computational and memory costs.