RT-DETRv2-S: Efficient Real-Time Detection Transformer
- The paper introduces RT-DETRv2-S, a compact real-time object detection model that achieves a strong accuracy–speed trade-off using scale-specific sampling and flexible training strategies.
- It employs a ResNet-18 backbone with a hybrid encoder and multi-scale deformable attention, delivering 217 FPS and competitive AP on the COCO dataset.
- The design incorporates a novel discrete sampling operator that reduces CUDA dependency for deployment on resource-constrained devices with only a negligible AP penalty.
RT-DETRv2-S is the "small" variant of the second-generation Real-Time Detection Transformer (RT-DETRv2), designed for high-throughput object detection with strong accuracy–speed trade-offs and deployability enhancements. RT-DETRv2-S employs a ResNet-18 backbone, a hybrid encoder for efficient intra- and inter-scale feature interactions, and a multi-scale deformable attention–based transformer decoder. Improvements in RT-DETRv2-S center on scale-specific sampling, flexible training strategies, and an optional discrete sampling operator that facilitates efficient deployment on resource-constrained devices (Lv et al., 2024).
1. Model Architecture
RT-DETRv2-S comprises four principal stages: a compact convolutional backbone, a hybrid feature encoder, a transformer-based decoder with multi-scale deformable attention, and dual prediction heads for classification and bounding box regression.
- Backbone: ResNet-18, operating on 640×640 input images. Outputs four feature maps projected to a common hidden dimension .
- : 320×320, 64ch
- : 160×160, 128ch
- : 80×80, 256ch
- : 40×40, 512ch
- Encoder: Applies local self-attention within each feature scale and fuses cross-scale information via convolutions and resampling, producing tensors .
- Decoder: 6 layers, each with 100 object queries (), hidden dim , and 8 attention heads (). Each layer leverages multi-scale deformable attention on all four scales, followed by a two-layer feed-forward network.
- Heads:
- Classification: 81-way linear (80 COCO classes plus 'no-object')
- Bounding box: 3-layer MLP ({256, 256, 4})
- Parameters: 20M
2. Multi-Scale Deformable Attention
The multi-scale deformable attention module is a core component facilitating efficient global context aggregation while controlling computational cost.
- For each query , at decoder layer , and for each head and scale , the module predicts:
- Sampling offsets
- Attention logits
Normalized attention weights:
Sampled positions:
- Output aggregation:
where projects back to query space.
- RT-DETRv2-S enables scale-specific for each feature scale, which allows tuning the balance between accuracy and compute. The overall attention load is:
Experimental ablations reveal that this total can be reduced (e.g., from 86,400 to 21,600) with only minor mAP degradation (≲0.6).
3. Discrete Sampling Operator
A significant innovation is the replacability of the grid_sample–based bilinear interpolation (CUDA-dependent) with a "discrete_sample" operator.
- Conventional sampling: uses bilinear interpolation over the four nearest pixels.
with .
- Discrete sampling: . This removes CUDA requirements and further optimizes deployment.
- During fine-tuning and inference, gradients w.r.t. offsets are halted in the "discrete_sample" pathway.
- Empirical results demonstrate a negligible AP penalty () for this substitution.
4. Training Strategy
RT-DETRv2-S adopts a staged training regime with scale-adaptive optimization.
- Optimizer: AdamW, batch size 16, weight EMA decay=0.9999
- Learning rates: Both backbone and detector set to for RT-DETRv2-S.
- Augmentation: Uses full RT-DETR augmentations during all but the last two epochs (RandomPhotometricDistort, RandomZoomOut, RandomIoUCrop, MultiScaleInput). In the final two epochs, these are disabled.
- Discrete sampling: Initial 6× training schedule uses grid_sample; final 1× epoch fine-tunes with discrete_sample operator.
- Loss Function: Standard DETR set-based loss with Hungarian matching for optimal query–ground-truth assignment:
with .
5. Quantitative Performance
RT-DETRv2-S demonstrates strong real-time performance and accuracy, benchmarked on COCO test-dev at input 640×640, batch size 1, TensorRT FP16, on a T4 GPU.
| Model | Params | FPS | AP | AP₅₀ |
|---|---|---|---|---|
| RT-DETRv2-S | 20M | 217 | 47.9 | 64.9 |
| RT-DETR-S | 20M | 217 | 46.5 | 63.8 |
- Ablations (total sampling points, grid_sample):
- 86,400 pts: AP=47.9, AP₅₀=64.9
- 64,800 pts: AP=47.8, AP₅₀=64.8
- 43,200 pts: AP=47.7, AP₅₀=64.7
- 21,600 pts: AP=47.3, AP₅₀=64.3
- Ablations (discrete_sample):
- Fine-tuning 1×: AP=47.4, AP₅₀=64.8
The results indicate that both scale-specific sampling and discrete sampling induce negligible loss in AP and AP₅₀, while substantially improving hardware compatibility and operational flexibility.
6. Practical Deployment and Recommendations
RT-DETRv2-S is engineered for minimal latency and broad platform compatibility.
- Inference: Discrete sampling recommended for deployment (CUDA-independence), with predicted AP loss below 0.5.
- Sampling allocation: For further compute gains, recommended to reduce the total number of sampled points. Halving attention cost results in ≈0.2 drop in mAP.
- Throughput: 217 FPS at 640×640, on par with YOLOv5-S.
- Recommended pipeline:
1. Train for 6× schedule with grid_sample. 2. Fine-tune 1× with discrete_sample. 3. Use discrete_sample for production inference on 640×640 inputs with TensorRT FP16.
This design supports efficient real-time object detection on a wide range of hardware, particularly resource-constrained edge devices, while achieving accuracy comparable to heavier architectures. (Lv et al., 2024)