Papers
Topics
Authors
Recent
Search
2000 character limit reached

RT-DETRv2-S: Efficient Real-Time Detection Transformer

Updated 27 January 2026
  • The paper introduces RT-DETRv2-S, a compact real-time object detection model that achieves a strong accuracy–speed trade-off using scale-specific sampling and flexible training strategies.
  • It employs a ResNet-18 backbone with a hybrid encoder and multi-scale deformable attention, delivering 217 FPS and competitive AP on the COCO dataset.
  • The design incorporates a novel discrete sampling operator that reduces CUDA dependency for deployment on resource-constrained devices with only a negligible AP penalty.

RT-DETRv2-S is the "small" variant of the second-generation Real-Time Detection Transformer (RT-DETRv2), designed for high-throughput object detection with strong accuracy–speed trade-offs and deployability enhancements. RT-DETRv2-S employs a ResNet-18 backbone, a hybrid encoder for efficient intra- and inter-scale feature interactions, and a multi-scale deformable attention–based transformer decoder. Improvements in RT-DETRv2-S center on scale-specific sampling, flexible training strategies, and an optional discrete sampling operator that facilitates efficient deployment on resource-constrained devices (Lv et al., 2024).

1. Model Architecture

RT-DETRv2-S comprises four principal stages: a compact convolutional backbone, a hybrid feature encoder, a transformer-based decoder with multi-scale deformable attention, and dual prediction heads for classification and bounding box regression.

  • Backbone: ResNet-18, operating on 640×640 input images. Outputs four feature maps C2,C3,C4,C5{C_2, C_3, C_4, C_5} projected to a common hidden dimension D=256D=256.
    • C2C_2: 320×320, 64ch
    • C3C_3: 160×160, 128ch
    • C4C_4: 80×80, 256ch
    • C5C_5: 40×40, 512ch
  • Encoder: Applies local self-attention within each feature scale and fuses cross-scale information via 1×11\times1 convolutions and resampling, producing tensors F1,,F4F_1,\ldots,F_4.
  • Decoder: 6 layers, each with 100 object queries (Nq=100N_q=100), hidden dim D=256D=256, and 8 attention heads (H=8H=8). Each layer leverages multi-scale deformable attention on all four scales, followed by a two-layer feed-forward network.
  • Heads:
    • Classification: 81-way linear (80 COCO classes plus 'no-object')
    • Bounding box: 3-layer MLP ({256, 256, 4})
  • Parameters: 20M

2. Multi-Scale Deformable Attention

The multi-scale deformable attention module is a core component facilitating efficient global context aggregation while controlling computational cost.

  • For each query qq, at decoder layer tt, and for each head hh and scale ll, the module predicts:
    • Sampling offsets Δpq,h,l,k(t)R2\Delta p_{q,h,l,k}^{(t)} \in \mathbb{R}^2
    • Attention logits aq,h,l,k(t)Ra_{q,h,l,k}^{(t)} \in \mathbb{R}
    • Normalized attention weights:

      Aq,h,l,k(t)=exp(aq,h,l,k(t))l=14k=1Klexp(aq,h,l,k(t))A_{q,h,l,k}^{(t)} = \frac{\exp(a_{q,h,l,k}^{(t)})} {\sum_{l'=1}^4\sum_{k'=1}^{K_{l'}}\exp(a_{q,h,l',k'}^{(t)})}

    • Sampled positions: sq,h,l,k(t)=pq(t)+Δpq,h,l,k(t)s_{q,h,l,k}^{(t)} = p_q^{(t)} + \Delta p_{q,h,l,k}^{(t)}

  • Output aggregation:

    MSDAttn(q)=h=1Hl=14k=1KlAq,h,l,k(t)WvFl(sq,h,l,k(t))\mathrm{MSDAttn}(q) = \sum_{h=1}^H \sum_{l=1}^4\sum_{k=1}^{K_l} A_{q,h,l,k}^{(t)} W_v F_l(s_{q,h,l,k}^{(t)})

    where WvW_v projects back to query space.

  • RT-DETRv2-S enables scale-specific KlK_l for each feature scale, which allows tuning the balance between accuracy and compute. The overall attention load is:

    Total points=(l=14Kl)×H×Nq×Ldec\text{Total points} = \left(\sum_{l=1}^4 K_l\right) \times H \times N_q \times L_\text{dec}

    Experimental ablations reveal that this total can be reduced (e.g., from 86,400 to 21,600) with only minor mAP degradation (≲0.6).

3. Discrete Sampling Operator

A significant innovation is the replacability of the grid_sample–based bilinear interpolation (CUDA-dependent) with a "discrete_sample" operator.

  • Conventional sampling: Fl(x,y)F_l(x, y) uses bilinear interpolation over the four nearest pixels.

    Fl(x,y)=i{x,x}j{y,y}wi,j(x,y)Fl(i,j)F_l(x, y) = \sum_{i \in \{\lfloor x \rfloor, \lceil x \rceil\}} \sum_{j \in \{\lfloor y \rfloor, \lceil y \rceil\}} w_{i,j}(x, y) F_l(i, j)

    with wi,j(x,y)=(1xi)(1yj)w_{i,j}(x, y) = (1-|x-i|)\,(1-|y-j|).

  • Discrete sampling: Fldisc(x,y)=Fl(round(x),round(y))F_l^{\mathrm{disc}}(x, y) = F_l(\mathrm{round}(x), \mathrm{round}(y)). This removes CUDA requirements and further optimizes deployment.
  • During fine-tuning and inference, gradients w.r.t. offsets are halted in the "discrete_sample" pathway.
  • Empirical results demonstrate a negligible AP penalty (0.5\leq 0.5) for this substitution.

4. Training Strategy

RT-DETRv2-S adopts a staged training regime with scale-adaptive optimization.

  • Optimizer: AdamW, batch size 16, weight EMA decay=0.9999
  • Learning rates: Both backbone and detector set to 1e41\mathrm{e}{-4} for RT-DETRv2-S.
  • Augmentation: Uses full RT-DETR augmentations during all but the last two epochs (RandomPhotometricDistort, RandomZoomOut, RandomIoUCrop, MultiScaleInput). In the final two epochs, these are disabled.
  • Discrete sampling: Initial 6× training schedule uses grid_sample; final 1× epoch fine-tunes with discrete_sample operator.
  • Loss Function: Standard DETR set-based loss with Hungarian matching for optimal query–ground-truth assignment:

    L=iM[logpi(ci)+λL1bibi1+λGIoU(1GIoU(bi,bi))]\mathcal{L} = \sum_{i \in M} [ -\log p_i(c_i^*) + \lambda_\text{L1}\|b_i - b_i^*\|_1 + \lambda_\text{GIoU}(1 - \mathrm{GIoU}(b_i, b_i^*)) ]

    with λL1=5,λGIoU=2\lambda_\text{L1} = 5,\, \lambda_\text{GIoU} = 2.

5. Quantitative Performance

RT-DETRv2-S demonstrates strong real-time performance and accuracy, benchmarked on COCO test-dev at input 640×640, batch size 1, TensorRT FP16, on a T4 GPU.

Model Params FPS AP AP₅₀
RT-DETRv2-S 20M 217 47.9 64.9
RT-DETR-S 20M 217 46.5 63.8
  • Ablations (total sampling points, grid_sample):
    • 86,400 pts: AP=47.9, AP₅₀=64.9
    • 64,800 pts: AP=47.8, AP₅₀=64.8
    • 43,200 pts: AP=47.7, AP₅₀=64.7
    • 21,600 pts: AP=47.3, AP₅₀=64.3
  • Ablations (discrete_sample):
    • Fine-tuning 1×: AP=47.4, AP₅₀=64.8

The results indicate that both scale-specific sampling and discrete sampling induce negligible loss in AP and AP₅₀, while substantially improving hardware compatibility and operational flexibility.

6. Practical Deployment and Recommendations

RT-DETRv2-S is engineered for minimal latency and broad platform compatibility.

  • Inference: Discrete sampling recommended for deployment (CUDA-independence), with predicted AP loss below 0.5.
  • Sampling allocation: For further compute gains, recommended to reduce the total number of sampled points. Halving attention cost results in ≈0.2 drop in mAP.
  • Throughput: 217 FPS at 640×640, on par with YOLOv5-S.
  • Recommended pipeline:

1. Train for 6× schedule with grid_sample. 2. Fine-tune 1× with discrete_sample. 3. Use discrete_sample for production inference on 640×640 inputs with TensorRT FP16.

This design supports efficient real-time object detection on a wide range of hardware, particularly resource-constrained edge devices, while achieving accuracy comparable to heavier architectures. (Lv et al., 2024)

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RT-DETRv2-S.