Defect Transformer: An Efficient Hybrid Transformer Architecture for Surface Defect Detection

Published 17 Jul 2022 in cs.CV | (2207.08319v1)

Abstract: Surface defect detection is an extremely crucial step to ensure the quality of industrial products. Nowadays, convolutional neural networks (CNNs) based on encoder-decoder architecture have achieved tremendous success in various defect detection tasks. However, due to the intrinsic locality of convolution, they commonly exhibit a limitation in explicitly modeling long-range interactions, critical for pixel-wise defect detection in complex cases, e.g., cluttered background and illegible pseudo-defects. Recent transformers are especially skilled at learning global image dependencies but with limited local structural information necessary for detailed defect location. To overcome the above limitations, we propose an efficient hybrid transformer architecture, termed Defect Transformer (DefT), for surface defect detection, which incorporates CNN and transformer into a unified model to capture local and non-local relationships collaboratively. Specifically, in the encoder module, a convolutional stem block is firstly adopted to retain more detailed spatial information. Then, the patch aggregation blocks are used to generate multi-scale representation with four hierarchies, each of them is followed by a series of DefT blocks, which respectively include a locally position-aware block for local position encoding, a lightweight multi-pooling self-attention to model multi-scale global contextual relationships with good computational efficiency, and a convolutional feed-forward network for feature transformation and further location information learning. Finally, a simple but effective decoder module is proposed to gradually recover spatial details from the skip connections in the encoder. Extensive experiments on three datasets demonstrate the superiority and efficiency of our method compared with other CNN- and transformer-based networks.

Abstract PDF Upgrade to Chat

Citations (45)

View on Semantic Scholar

Summary

The paper introduces DefT, a novel hybrid architecture combining CNNs and transformers to achieve efficient pixel-wise defect segmentation.
It leverages innovative components like LPB, LMPS, and CFFN to balance computational efficiency with high detection accuracy in challenging datasets.
Experimental results demonstrate improved F1-score, FNR, and other metrics, highlighting DefT’s robust performance for industrial defect detection even in low-data scenarios.

"Defect Transformer: An Efficient Hybrid Transformer Architecture for Surface Defect Detection" (2207.08319)

Overview

The paper introduces Defect Transformer (DefT), a hybrid architecture integrating CNNs and transformers for surface defect detection in industrial products. The goal is to exploit the strengths of CNNs in local feature representation and transformers in global contextual modeling. The DefT network adopts an encoder-decoder framework with novel components designed for efficient and precise pixel-wise defect segmentation.

Architecture Design

The DefT network is built upon a UNet-like encoder-decoder architecture focusing on balance between computational efficiency and detection performance. The encoder incorporates convolutional stem blocks and patch aggregation blocks, followed by DefT blocks that consist of locally position-aware block (LPB), lightweight multi-pooling self-attention (LMPS), and convolutional feed-forward network (CFFN).

Locally Position-Aware Block (LPB): Enhances spatial detail retention through convolutional operations, embedding local positional information implicitly which is pivotal for fine segmentation tasks.
- (Figure 1)

Figure 1: The structure of main components in the DefT block: (a) locally position-aware block; (b) lightweight multi-pooling self-attention; (c) convolutional feed-forward network.

Lightweight Multi-Pooling Self-Attention (LMPS): Implements self-attention on multi-scale pooled features to efficiently model global dependencies with reduced computational complexity.
Convolutional Feed-Forward Network (CFFN): Integrates convolution operations within the feed-forward structure to further encode spatial locality.

The decoder adopts skip-layer feature merging via interpolation and convolution operations to reconstruct detailed defect masks, structurally simple yet effective.

Patch Aggregation Block (PAB): Uses overlapping convolution for spatial downsampling, preserving neighborhood relations.

The network architecture integrates overlapping patch aggregation for refined spatial representation compared to linear embeddings seen in transformers like ViT.

Figure 2: The overall architecture diagram of our proposed DefT network, which consists of two main modules, the lower one is the encoder module responsible for gradually reducing the spatial resolution and learning feature transformation, while the upper one is the decoder module used to progressively recover finer details by merging features from the corresponding encoder module.

Experimentation and Results

Extensive experiments were conducted on challenging datasets including SD-saliency-900, fabric defects, and NRSD-MN, demonstrating the superior performance of DefT in terms of F1-score, FNR, ACC, and MAE compared to both CNN-based and transformer-based models.

SD-saliency-900 Dataset: Achieved substantial improvements in defect recognition particularly in cluttered background scenarios, outperforming competing methods in precision and recall visualized through PR curves.
Figure 3: Visual comparison of different methods on SD-saliency-900 dataset. (a) Original image. (b) UNet. (c) MedT. (d) SETR. (e) Swin-unet. (f) SegFormer. (g) TransUNet. (h) MCnet. (i) EDRNet. (j) Ours. (k) Ground truth.
Encoder Improvements: Highlighted the importance of convolutional design in integrating inductive biases and enhancing local feature extraction capabilities, reflected in faster convergence and better data efficiency.

Figure 4: Performance evaluation of different configurations in terms of F-measure (left) and PR (right) curves.

Discussion and Implications

The DefT model introduces a pragmatic approach to balance the trade-offs between precision, computational cost, and inference speed. By embedding CNN-like properties in transformer architectures, DefT leverages inductive bias for enhanced training and data efficiency, particularly advantageous in industrial applications where large datasets are challenging to obtain.

Inductive Bias and Efficiency: Demonstrated that the explicit convolution-based inductive bias significantly aids in training stability and robustness in low-data scenarios, outperforming models reliant on extensive pre-training.
Figure 5: Comparison of training efficiency (left) and data efficiency (right) of DefT and Swin-Unet on NRSD-MN dataset. * means the model initialized by pre-trained weights on ImageNet.

Conclusion

DefT exemplifies the integration of CNN and transformer for defect detection, setting a precedent for future improvements in pixel-wise segmentation tasks. Its efficiency in computational resource usage and capability to maintain high segmentation accuracy even with limited data showcase its potential for broader application in real-time industrial defect detection systems. The robust architecture paves the way for innovative hybrid models capable of addressing complex machine vision challenges.