RT-DETRv2-S: Efficient Real-Time Detection Transformer

Updated 27 January 2026

The paper introduces RT-DETRv2-S, a compact real-time object detection model that achieves a strong accuracy–speed trade-off using scale-specific sampling and flexible training strategies.
It employs a ResNet-18 backbone with a hybrid encoder and multi-scale deformable attention, delivering 217 FPS and competitive AP on the COCO dataset.
The design incorporates a novel discrete sampling operator that reduces CUDA dependency for deployment on resource-constrained devices with only a negligible AP penalty.

RT-DETRv2-S is the "small" variant of the second-generation Real-Time Detection Transformer (RT-DETRv2), designed for high-throughput object detection with strong accuracy–speed trade-offs and deployability enhancements. RT-DETRv2-S employs a ResNet-18 backbone, a hybrid encoder for efficient intra- and inter-scale feature interactions, and a multi-scale deformable attention–based transformer decoder. Improvements in RT-DETRv2-S center on scale-specific sampling, flexible training strategies, and an optional discrete sampling operator that facilitates efficient deployment on resource-constrained devices (Lv et al., 2024).

1. Model Architecture

RT-DETRv2-S comprises four principal stages: a compact convolutional backbone, a hybrid feature encoder, a transformer-based decoder with multi-scale deformable attention, and dual prediction heads for classification and bounding box regression.

Backbone: ResNet-18, operating on 640×640 input images. Outputs four feature maps ${C_2, C_3, C_4, C_5}$ $C_{2}, C_{3}, C_{4}, C_{5}$ projected to a common hidden dimension $D=256$ $D = 256$ .
- $C_2$ : 320×320, 64ch
- $C_3$ : 160×160, 128ch
- $C_4$ : 80×80, 256ch
- $C_5$ : 40×40, 512ch
Encoder: Applies local self-attention within each feature scale and fuses cross-scale information via $1\times1$ convolutions and resampling, producing tensors $F_1,\ldots,F_4$ .
Decoder: 6 layers, each with 100 object queries ( $N_q=100$ ), hidden dim $D=256$ , and 8 attention heads ( $D=256$ 0). Each layer leverages multi-scale deformable attention on all four scales, followed by a two-layer feed-forward network.
Heads:
- Classification: 81-way linear (80 COCO classes plus 'no-object')
- Bounding box: 3-layer MLP ({256, 256, 4})
Parameters: 20M

2. Multi-Scale Deformable Attention

The multi-scale deformable attention module is a core component facilitating efficient global context aggregation while controlling computational cost.

For each query $D=256$ $D = 256$ 1, at decoder layer $D=256$ $D = 256$ 2, and for each head $D=256$ $D = 256$ 3 and scale $D=256$ $D = 256$ 4, the module predicts:
- Sampling offsets $D=256$ 5
- Attention logits $D=256$ 6
- Normalized attention weights:
  
  $D=256$ 7
- Sampled positions: $D=256$ 8
Output aggregation:

$D=256$ 9

where $C_2$ 0 projects back to query space.
RT-DETRv2-S enables scale-specific $C_2$ 1 for each feature scale, which allows tuning the balance between accuracy and compute. The overall attention load is:

$C_2$ 2

Experimental ablations reveal that this total can be reduced (e.g., from 86,400 to 21,600) with only minor mAP degradation (≲0.6).

3. Discrete Sampling Operator

A significant innovation is the replacability of the grid_sample–based bilinear interpolation (CUDA-dependent) with a "discrete_sample" operator.

Conventional sampling: $C_2$ 3 uses bilinear interpolation over the four nearest pixels.

$C_2$ 4

with $C_2$ 5.
Discrete sampling: $C_2$ 6. This removes CUDA requirements and further optimizes deployment.
During fine-tuning and inference, gradients w.r.t. offsets are halted in the "discrete_sample" pathway.
Empirical results demonstrate a negligible AP penalty ( $C_2$ 7) for this substitution.

4. Training Strategy

RT-DETRv2-S adopts a staged training regime with scale-adaptive optimization.

Optimizer: AdamW, batch size 16, weight EMA decay=0.9999
Learning rates: Both backbone and detector set to $C_2$ 8 for RT-DETRv2-S.
Augmentation: Uses full RT-DETR augmentations during all but the last two epochs (RandomPhotometricDistort, RandomZoomOut, RandomIoUCrop, MultiScaleInput). In the final two epochs, these are disabled.
Discrete sampling: Initial 6× training schedule uses grid_sample; final 1× epoch fine-tunes with discrete_sample operator.
Loss Function: Standard DETR set-based loss with Hungarian matching for optimal query–ground-truth assignment:

$C_2$ 9

with $C_3$ 0.

5. Quantitative Performance

RT-DETRv2-S demonstrates strong real-time performance and accuracy, benchmarked on COCO test-dev at input 640×640, batch size 1, TensorRT FP16, on a T4 GPU.

Model	Params	FPS	AP	AP₅₀
RT-DETRv2-S	20M	217	47.9	64.9
RT-DETR-S	20M	217	46.5	63.8

Ablations (total sampling points, grid_sample):
- 86,400 pts: AP=47.9, AP₅₀=64.9
- 64,800 pts: AP=47.8, AP₅₀=64.8
- 43,200 pts: AP=47.7, AP₅₀=64.7
- 21,600 pts: AP=47.3, AP₅₀=64.3
Ablations (discrete_sample):
- Fine-tuning 1×: AP=47.4, AP₅₀=64.8

The results indicate that both scale-specific sampling and discrete sampling induce negligible loss in AP and AP₅₀, while substantially improving hardware compatibility and operational flexibility.

6. Practical Deployment and Recommendations

RT-DETRv2-S is engineered for minimal latency and broad platform compatibility.

Inference: Discrete sampling recommended for deployment (CUDA-independence), with predicted AP loss below 0.5.
Sampling allocation: For further compute gains, recommended to reduce the total number of sampled points. Halving attention cost results in ≈0.2 drop in mAP.
Throughput: 217 FPS at 640×640, on par with YOLOv5-S.
Recommended pipeline:

1. Train for 6× schedule with grid_sample. 2. Fine-tune 1× with discrete_sample. 3. Use discrete_sample for production inference on 640×640 inputs with TensorRT FP16.

This design supports efficient real-time object detection on a wide range of hardware, particularly resource-constrained edge devices, while achieving accuracy comparable to heavier architectures. (Lv et al., 2024)

Markdown Report Issue Upgrade to Chat

References (1)

RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RT-DETRv2-S.