DETRs Beat YOLOs on Real-time Object Detection

Published 17 Apr 2023 in cs.CV | (2304.08069v3)

Abstract: The YOLO series has become the most popular framework for real-time object detection due to its reasonable trade-off between speed and accuracy. However, we observe that the speed and accuracy of YOLOs are negatively affected by the NMS. Recently, end-to-end Transformer-based detectors (DETRs) have provided an alternative to eliminating NMS. Nevertheless, the high computational cost limits their practicality and hinders them from fully exploiting the advantage of excluding NMS. In this paper, we propose the Real-Time DEtection TRansformer (RT-DETR), the first real-time end-to-end object detector to our best knowledge that addresses the above dilemma. We build RT-DETR in two steps, drawing on the advanced DETR: first we focus on maintaining accuracy while improving speed, followed by maintaining speed while improving accuracy. Specifically, we design an efficient hybrid encoder to expeditiously process multi-scale features by decoupling intra-scale interaction and cross-scale fusion to improve speed. Then, we propose the uncertainty-minimal query selection to provide high-quality initial queries to the decoder, thereby improving accuracy. In addition, RT-DETR supports flexible speed tuning by adjusting the number of decoder layers to adapt to various scenarios without retraining. Our RT-DETR-R50 / R101 achieves 53.1% / 54.3% AP on COCO and 108 / 74 FPS on T4 GPU, outperforming previously advanced YOLOs in both speed and accuracy. We also develop scaled RT-DETRs that outperform the lighter YOLO detectors (S and M models). Furthermore, RT-DETR-R50 outperforms DINO-R50 by 2.2% AP in accuracy and about 21 times in FPS. After pre-training with Objects365, RT-DETR-R50 / R101 achieves 55.3% / 56.2% AP. The project page: https://zhao-yian.github.io/RTDETR.

Abstract PDF HTML Upgrade to Chat

References (50)

Citations (409)

View on Semantic Scholar

Summary

The paper introduces optimization strategies for DETRs, enhancing their inference speed to meet real-time detection demands.
It rigorously compares DETR and YOLO models, demonstrating that optimized DETRs offer improved precision and recall rates.
The study suggests that transitioning to DETR-based models could yield significant benefits in applications like autonomous driving and surveillance.

DETRs Beat YOLOs on Real-time Object Detection

The paper "DETRs Beat YOLOs on Real-time Object Detection," authored by Wenyu Lv et al. from Baidu Inc., presents a detailed comparison between Detection Transformers (DETRs) and the You Only Look Once (YOLO) family of models regarding their performance in real-time object detection tasks.

Abstract and Introduction

The authors begin by outlining the advancements in object detection, highlighting the significant impact of YOLO models in real-time applications due to their high speed and accuracy. However, the paper aims to challenge the dominance of YOLOs by proposing that DETR models achieve superior performance under real-time constraints.

The related work section provides an extensive review of both YOLO-based models and the relatively newer DETR models. The YOLO models, renowned for their lightweight architecture and rapid inference capabilities, have been widely adopted in various real-time applications. Conversely, DETR models, benefiting from the transformer architecture, have shown promise in achieving higher accuracy and robustness in object detection tasks but at a higher computational cost. This section sets the stage for the authors' argument by highlighting the strengths and weaknesses of each approach.

Speed Considerations

One of the crucial sections of the paper examines the speed-performance trade-offs between DETRs and YOLOs. The authors introduce various optimization techniques applied to DETRs to enhance their inference speed, such as optimizing the attention mechanism and reducing the model's computational overhead. They provide a comprehensive analysis of the inference times, demonstrating that when these optimizations are incorporated, DETRs can achieve competitive, if not superior, real-time performance compared to YOLO models.

Methodology

The methodology section describes the experimental setup used to evaluate the models. The authors meticulously detail the datasets employed, the evaluation metrics, and the specific configurations of both DETR and YOLO models. They also describe the hyperparameters and the training protocols followed to ensure a fair comparison. This rigorous approach ensures that the results presented are robust and reproducible.

Experimental Results

The experimental results form the core contribution of this paper. The authors present a series of experiments comparing the performance of DETR and YOLO models across various datasets. They highlight that DETRs, when appropriately optimized, not only match but in several cases outperform YOLO models in real-time settings. The paper provides strong numerical results, demonstrating improvements in detection precision and recall rates while maintaining acceptable inference speeds for real-time applications.

Conclusions and Implications

In the conclusion section, the authors summarize their findings, asserting that optimized DETR models present a viable alternative to YOLOs for real-time object detection tasks. They discuss the practical implications of their research, suggesting that industries reliant on real-time object detection might consider transitioning to DETR-based models to leverage their enhanced accuracy and robustness. The paper also hints at potential future work, such as further optimization techniques for DETR models and exploring their applicability in other real-time computer vision tasks.

Theoretical and Practical Implications

From a theoretical perspective, the paper's findings challenge the prevailing notion that transformer-based models are unsuitable for real-time applications due to their computational complexity. It opens avenues for further research into optimizing transformers for speed without compromising their accuracy benefits. Practically, this research could influence the design of next-generation real-time detection systems, potentially leading to more accurate and reliable applications in fields such as autonomous driving, surveillance, and robotics.

Future Developments

Future developments following this research might include deeper investigations into more efficient transformer architectures, the integration of hardware accelerators to further reduce inference times, and broader evaluations across different real-time scenarios to validate the generalizability of the findings.

In conclusion, Wenyu Lv et al.'s paper makes a compelling case for the adoption of DETR models in real-time object detection, provided that appropriate optimizations are implemented. This work is a significant step towards bridging the gap between the high accuracy of transformer models and the speed requirements of real-time applications, offering promising directions for both research and practical implementations in AI-driven object detection.