ViDT: An Efficient and Effective Fully Transformer-based Object Detector

Published 8 Oct 2021 in cs.CV and cs.LG | (2110.03921v2)

Abstract: Transformers are transforming the landscape of computer vision, especially for recognition tasks. Detection transformers are the first fully end-to-end learning systems for object detection, while vision transformers are the first fully transformer-based architecture for image classification. In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector. ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector, followed by a computationally efficient transformer decoder that exploits multi-scale features and auxiliary techniques essential to boost the detection performance without much increase in computational load. Extensive evaluation results on the Microsoft COCO benchmark dataset demonstrate that ViDT obtains the best AP and latency trade-off among existing fully transformer-based object detectors, and achieves 49.2AP owing to its high scalability for large models. We will release the code and trained models at https://github.com/naver-ai/vidt

Abstract PDF Upgrade to Chat

Citations (68)

View on Semantic Scholar

Summary

The paper introduces ViDT, which combines Vision and Detection Transformers via a reconfigured attention module to enhance both efficiency and accuracy.
It employs an encoder-free neck and token matching knowledge distillation to streamline object detection while maintaining competitive average precision.
Experimental results on the COCO benchmark show that ViDT achieves 49.2 AP with a favorable latency trade-off compared to existing detectors.

ViDT: An Efficient and Effective Fully Transformer-based Object Detector

The paper "ViDT: An Efficient and Effective Fully Transformer-based Object Detector" (2110.03921) introduces ViDT, a novel object detection architecture that integrates Vision Transformers (ViT) and Detection Transformers (DETR) to achieve state-of-the-art performance with improved computational efficiency. The core innovation lies in the Reconfigured Attention Module (RAM), which enables the use of ViT variants, specifically the Swin Transformer, as standalone object detectors. This is coupled with a lightweight, encoder-free neck architecture and a token matching knowledge distillation technique to further enhance performance and efficiency. ViDT achieves a compelling balance between accuracy and speed, demonstrating its potential for practical applications in object detection.

Key Components and Implementation

ViDT's architecture comprises three main components: a reconfigured ViT backbone, an encoder-free neck, and a prediction head.

Reconfigured Attention Module (RAM)

The RAM is the cornerstone of ViDT, enabling the integration of ViTs, particularly Swin Transformers, into a sequence-to-sequence object detection framework. RAM decomposes the attention mechanism into three distinct operations:

$[\mathtt{PATCH}]\times[\mathtt{PATCH}]$ Attention: This applies the Swin Transformer's efficient local attention mechanism with shifted windows to the image patches ( $[\mathtt{PATCH}]$ tokens), enabling hierarchical feature extraction with linear complexity. This preserves the original Swin Transformer's ability to capture global information.
$[\mathtt{DET}]\times[\mathtt{DET}]$ Attention: This performs global self-attention on a set of learnable detection tokens ( $[\mathtt{DET}]$ tokens). These tokens represent the objects to be detected and maintain a fixed scale throughout the network. This allows the model to capture relationships between different object detections.
$[\mathtt{DET}]\times[\mathtt{PATCH}]$ Attention: This cross-attention mechanism aggregates key content from the image patches into object embeddings for each detection token, effectively associating each $[\mathtt{DET}]$ token with a specific object in the image.
Figure 1: Pipelines of fully transformer-based object detectors. DETR (ViT) means Detection Transformer that uses ViT as its body. The proposed ViDT synergizes DETR\,(ViT) and YOLOS and achieves best AP and latency trade-off among fully transformer-based object detectors.

RAM reuses the projection layers of the Swin Transformer for both $[\mathtt{DET}]$ and $[\mathtt{PATCH}]$ tokens. Positional encodings are handled differently for each attention type: relative position bias for $[\mathtt{PATCH}]\times[\mathtt{PATCH}]$ attention, learnable positional encoding for $[\mathtt{DET}]\times[\mathtt{DET}]$ attention, and sinusoidal spatial positional encoding for $[\mathtt{DET}]\times[\mathtt{PATCH}]$ attention. The spatial positional encoding is added to the $[\mathtt{PATCH}]$ tokens before the projection layer. To minimize computational overhead, the $[\mathtt{DET}]\times[\mathtt{PATCH}]$ cross-attention is applied only at the last stage of the Swin Transformer.

def reconfigured_attention(query_patches, key_patches, value_patches,
                           query_det, key_det, value_det):
    """
    Reconfigured Attention Module (RAM) for ViDT.

    Args:
        query_patches: Query embeddings for image patches.
        key_patches: Key embeddings for image patches.
        value_patches: Value embeddings for image patches.
        query_det: Query embeddings for detection tokens.
        key_det: Key embeddings for detection tokens.
        value_det: Value embeddings for detection tokens.

    Returns:
        Updated patch and detection token embeddings.
    """

    # Patch self-attention (local attention using shifted windows)
    updated_patches = swin_transformer_attention(query_patches, key_patches, value_patches)

    # Detection token self-attention (global self-attention)
    attention_det = torch.matmul(query_det, key_det.transpose(-2, -1)) / math.sqrt(query_det.size(-1))
    attention_det = torch.softmax(attention_det, dim=-1)
    updated_det_det = torch.matmul(attention_det, value_det)

    # Detection token cross-attention (attention over patches)
    attention_cross = torch.matmul(query_det, key_patches.transpose(-2, -1)) / math.sqrt(query_det.size(-1))
    attention_cross = torch.softmax(attention_cross, dim=-1)
    updated_det_cross = torch.matmul(attention_cross, value_patches)

    # Combine updated detection tokens
    updated_det = updated_det_det + updated_det_cross

    return updated_patches, updated_det

Figure 2: Reconfigured Attention Module (Q: query, K: key, V: value). The skip connection and feedforward networks following the attention operation is omitted just for ease of exposition.

Encoder-Free Neck Structure

ViDT employs an encoder-free neck structure, comprising a decoder of multi-layer deformable transformers. This design choice is motivated by the RAM's ability to directly extract fine-grained features suitable for object detection, eliminating the need for a transformer encoder. The decoder receives multi-scale feature maps $\{\bm x^{l}\}_{l=1}^{L}$ from each stage of the Swin Transformer and $[\mathtt{DET}]$ tokens from the last stage. Each deformable transformer layer performs $[\mathtt{DET}]\times[\mathtt{DET}]$ attention followed by multi-scale deformable attention:

${\rm MSDeformAttn}([\mathtt{DET}], \{ {\bm x^{l} \}_{l=1}^{L}) = \sum_{m=1}^{M} {\bm W_m} \Big[ \sum_{l=1}^{L} \sum_{k=1}^{K} A_{mlk} \cdot {\bm W_m^{\prime} {\bm x^{l}\big(\phi_{l}({\bm p}) + \Delta {\bm p_{mlk}\big) \Big]$

Here, $m$ indexes the attention head, $K$ is the number of sampled keys, $\phi_{l}({\bm p})$ is the reference point, $\Delta{\bm p_{mlk}}$ is the sampling offset, and $A_{mlk}$ represents the attention weights. Auxiliary decoding loss and iterative box refinement techniques are incorporated to further enhance detection performance.

Token Matching Knowledge Distillation

The paper introduces a knowledge distillation approach using token matching to transfer knowledge from a large, pre-trained ViDT model (teacher) to a smaller model (student). This leverages the fixed number of $[\mathtt{PATCH}]$ and $[\mathtt{DET}]$ tokens across different ViDT models. The distillation loss is formulated as:

$\ell_{dis}(\mathcal{P}_{s}, \mathcal{D}_{s}, \mathcal{P}_{t}, \mathcal{D}_{t}) = \lambda_{dis} \Big( \frac{1}{|\mathcal{P}_s|} \sum_{i=1}^{|\mathcal{P}_s|} \norm[\Big]{ \mathcal{P}_s[i] - \mathcal{P}_t[i] }_{2} + \frac{1}{|\mathcal{D}_s|} \sum_{i=1}^{|\mathcal{D}_s|} \norm[\Big]{ \mathcal{D}_s[i] - \mathcal{D}_t[i] }_{2} \Big)$.

Only tokens contributing the most to prediction are matched. $\mathcal{P}$ and $\mathcal{D}$ represent the sets of $[\mathtt{PATCH}]$ and $[\mathtt{DET}]$ tokens, respectively, with subscripts $s$ and $t$ denoting the student and teacher models.

Experimental Results

ViDT demonstrates superior performance on the Microsoft COCO benchmark, achieving the best AP and latency trade-off compared to existing fully transformer-based object detectors. Ablation studies validate the effectiveness of the RAM, spatial positional encoding, and auxiliary techniques. Knowledge distillation with token matching further improves performance without compromising efficiency. The paper also shows that decoding layer dropping can expedite inference with minimal impact on accuracy. ViDT achieves 49.2 AP with the Swin-base backbone.

Figure 3: {Visualization of the attention map for cross-attention with ViDT\,(Swin-nano).

Implications and Future Directions

ViDT's architecture offers significant advantages in terms of scalability, flexibility, and efficiency for object detection. The use of RAM allows for the seamless integration of various ViT backbones, while the encoder-free neck reduces computational overhead. The token matching knowledge distillation provides a simple yet effective way to transfer knowledge and improve the performance of smaller models. Further research could explore the integration of ViDT with other efficient vision transformer architectures. Future work could also focus on optimizing the design of the neck decoder and exploring alternative knowledge distillation strategies. The exploration of ViDT's applicability to other vision tasks, such as instance segmentation and object tracking, represents another promising avenue for future research.

Conclusion

ViDT presents a compelling approach to object detection by effectively integrating vision and detection transformers. The reconfigured attention module, encoder-free neck, and token matching knowledge distillation contribute to its high accuracy, efficiency, and scalability. ViDT's performance on the COCO benchmark highlights the potential of fully transformer-based models for complex computer vision tasks.