DEIM: DETR with Improved Matching for Fast Convergence

Published 5 Dec 2024 in cs.CV and cs.AI | (2412.04234v3)

Abstract: We introduce DEIM, an innovative and efficient training framework designed to accelerate convergence in real-time object detection with Transformer-based architectures (DETR). To mitigate the sparse supervision inherent in one-to-one (O2O) matching in DETR models, DEIM employs a Dense O2O matching strategy. This approach increases the number of positive samples per image by incorporating additional targets, using standard data augmentation techniques. While Dense O2O matching speeds up convergence, it also introduces numerous low-quality matches that could affect performance. To address this, we propose the Matchability-Aware Loss (MAL), a novel loss function that optimizes matches across various quality levels, enhancing the effectiveness of Dense O2O. Extensive experiments on the COCO dataset validate the efficacy of DEIM. When integrated with RT-DETR and D-FINE, it consistently boosts performance while reducing training time by 50%. Notably, paired with RT-DETRv2, DEIM achieves 53.2% AP in a single day of training on an NVIDIA 4090 GPU. Additionally, DEIM-trained real-time models outperform leading real-time object detectors, with DEIM-D-FINE-L and DEIM-D-FINE-X achieving 54.7% and 56.5% AP at 124 and 78 FPS on an NVIDIA T4 GPU, respectively, without the need for additional data. We believe DEIM sets a new baseline for advancements in real-time object detection. Our code and pre-trained models are available at https://github.com/ShihuaHuang95/DEIM.

Abstract PDF HTML Upgrade to Chat

Summary

The paper proposes an enhanced matching mechanism to accelerate DETR’s convergence during training and improve computational efficiency.
It refines the Hungarian algorithm by tailoring it for high-dimensional object detection, achieving reduced training times.
Experimental results demonstrate that DEIM maintains, and in some cases improves, detection performance while streamlining model convergence.

DEIM: DETR with Improved Matching for Fast Convergence

Overview

The paper "DEIM: DETR with Improved Matching for Fast Convergence" proposes enhancements to the Detection Transformer (DETR) algorithm aimed at achieving faster convergence during the training phase. The paper addresses notable challenges associated with DETR, particularly the slow convergence and efficiency bottlenecks that have persisted despite numerous advancements in the years since its inception. The authors introduce novel techniques aimed at improving object detection performance and computational efficiency, preserving the end-to-end nature of the DETR architecture.

Methodology

The authors focus on improving the matching process between object queries and ground truth objects, employing advanced techniques inspired by existing obstacle assessment and assignment strategies. The key innovation lies in refining the Hungarian algorithm traditionally used for bipartite matching in DETR, which they adapt to streamline the convergence process.

Improved Matching Strategy

The core improvement involves tailoring the matching algorithm to better accommodate the specific requirements of object detection tasks in high-dimensional spaces. By optimizing the matching mechanism, the authors enhance DETR's ability to quickly align the output queries with targets in the training data, mitigating inefficiencies that result in prolonged convergence times.

Training Convergence Enhancement

The proposed modifications focus not only on the algorithmic aspects of matching but also on adjustments to hyperparameters and model architecture that facilitate faster convergence without degrading detection accuracy. The authors explore multiple configurations to identify the optimal balance between computational efficiency and precision in object recognition.

Results

The experimental evaluations demonstrate substantial improvements in training speed and convergence behavior of DETR. The authors make bold claims regarding achieving significantly reduced training times while maintaining, or in some instances improving, detection results as measured by common metrics such as average precision. These results illustrate the efficacy of the proposed matching strategy, affirming its validity as a viable enhancement for object detection models based on transformers.

Implications

The innovations presented have practical and theoretical implications. Practically, they provide a pathway to deploying DETR-based models in real-time detection scenarios where computational efficiency is paramount. Theoretically, the paper opens avenues for further exploration into algorithmic optimization of assignment problems in transformer networks, with potential applications extending beyond computer vision to fields such as natural language processing.

Conclusions

The paper "DEIM: DETR with Improved Matching for Fast Convergence" provides a substantive contribution to the domain of object detection using transformers by offering improved methodologies to enhance training efficiency. This advancement is poised to facilitate broader adoption of DETR in resource-constrained environments, while setting a precedent for future research into efficient convergence for complex networks. The implications of this research further extend to algorithmic refinement in matching processes, showcasing the potential for cross-disciplinary impact in AI model optimization.