Can CNNs Be More Robust Than Transformers?

Published 7 Jun 2022 in cs.CV | (2206.03452v2)

Abstract: The recent success of Vision Transformers is shaking the long dominance of Convolutional Neural Networks (CNNs) in image recognition for a decade. Specifically, in terms of robustness on out-of-distribution samples, recent research finds that Transformers are inherently more robust than CNNs, regardless of different training setups. Moreover, it is believed that such superiority of Transformers should largely be credited to their self-attention-like architectures per se. In this paper, we question that belief by closely examining the design of Transformers. Our findings lead to three highly effective architecture designs for boosting robustness, yet simple enough to be implemented in several lines of code, namely a) patchifying input images, b) enlarging kernel size, and c) reducing activation layers and normalization layers. Bringing these components together, we are able to build pure CNN architectures without any attention-like operations that are as robust as, or even more robust than, Transformers. We hope this work can help the community better understand the design of robust neural architectures. The code is publicly available at https://github.com/UCSC-VLAA/RobustCNN.

Abstract PDF Upgrade to Chat

Citations (42)

View on Semantic Scholar

Summary

The paper demonstrates that strategic modifications in CNN architectures can enhance robustness to outperform Transformers on out-of-distribution image benchmarks.
It employs innovative techniques such as patchifying input images, enlarging convolutional kernels, and reducing activation/normalization layers for improved model efficiency.
Empirical results reveal up to a 4.0% improvement over DeiT-S on datasets like ImageNet-R, highlighting practical advances in robust image recognition.

An Analysis of CNN Robustness against Transformers in Vision Tasks

The paper "Can CNNs Be More Robust Than Transformers?" provides a probing examination into the enduring debate between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) regarding their robustness, particularly when handling out-of-distribution samples in image recognition tasks. The authors challenge the prevailing notion that the inherent robustness of Transformers can primarily be attributed to their self-attention architectures. The research introduces novel architectural modifications for CNNs that enhance their robustness to a level comparable with, if not exceeding, that of Transformers.

Architectural Innovations for Robust CNNs

The study identifies three architectural changes to CNNs that improve their robustness significantly:

Patchifying Input Images: By adapting a patchify approach similar to that employed in ViTs, where input images are divided into non-overlapping patches, they increased robustness. Using larger patch sizes notably enhanced performance on various robustness benchmarks.
Enlarging Convolutional Kernel Size: Departing from the standard practice of utilizing small convolutional kernels, the authors observed that increasing kernel size from $3 \times 3$ to sizes as large as $11 \times 11$ improved model robustness significantly when dealing with out-of-distribution samples.
Reducing Activation and Normalization Layers: Inspired by Transformers, reducing the number of these layers contributed to considerable robustness gains. This change not only improved performance but also enhanced training efficiency with notable speed-ups.

Empirical Results and Implications

The authors conducted comprehensive experiments across various benchmarks like Stylized-ImageNet, ImageNet-C, ImageNet-R, and ImageNet-Sketch. Their enhanced CNN architecture, labeled as Robust-ResNet, consistently outperformed its CNN counterparts when implemented with all the proposed modifications, even surpassing Transformers such as DeiT-S in several out-of-distribution robustness tests. For instance, the enhanced ResNet showed a 4.0% improvement on ImageNet-R and a 3.9% improvement on ImageNet-Sketch over DeiT-S.

These results suggest that Transformers' perceived superiority in robustness may partly stem from certain architectural elements rather than the self-attention mechanism alone. Consequently, with appropriate architectural choices, CNNs can be reshaped to harness similar levels of robustness.

Theoretical and Practical Contributions

The findings provide important insights into neural architecture designs that enhance robustness in image recognition systems. The architectural recommendations are simple and computationally efficient, thus easily adoptable for current CNN-based systems without significant overhead. The potential to scale these designs for larger model architectures also holds promise for future developments in AI applications, where robustness is critical.

Future Directions

Future work could further explore the scalability of these architectural modifications across different types of vision tasks beyond image recognition. Furthermore, examining the integration of self-attention mechanisms within this redefined CNN framework might yield further refinements in model performance, retaining computational efficiency while benefiting from the strengths of both paradigms.

In conclusion, this paper re-evaluates the robustness potential of CNNs and provides compelling alternatives to widely accepted architectural designs, urging the research community to reconsider the standing preconceptions about CNN and Transformer architectures. As AI systems continue to evolve, embracing and synthesizing the advantageous aspects of both CNNs and Transformers will likely be an ongoing area of innovation and interest.

Markdown Report Issue