Achieving 3D Attention via Triplet Squeeze and Excitation Block

Published 9 May 2025 in cs.CV, cs.AI, and cs.LG | (2505.05943v1)

Abstract: The emergence of ConvNeXt and its variants has reaffirmed the conceptual and structural suitability of CNN-based models for vision tasks, re-establishing them as key players in image classification in general, and in facial expression recognition (FER) in particular. In this paper, we propose a new set of models that build on these advancements by incorporating a new set of attention mechanisms that combines Triplet attention with Squeeze-and-Excitation (TripSE) in four different variants. We demonstrate the effectiveness of these variants by applying them to the ResNet18, DenseNet and ConvNext architectures to validate their versatility and impact. Our study shows that incorporating a TripSE block in these CNN models boosts their performances, particularly for the ConvNeXt architecture, indicating its utility. We evaluate the proposed mechanisms and associated models across four datasets, namely CIFAR100, ImageNet, FER2013 and AffectNet datasets, where ConvNext with TripSE achieves state-of-the-art results with an accuracy of \textbf{78.27\%} on the popular FER2013 dataset, a new feat for this dataset.

Abstract PDF Upgrade to Chat

Summary

Achieving 3D Attention via Triplet Squeeze and Excitation Block

The paper in question introduces a novel approach to enhance 3D attention in computer vision models by integrating Triplet Attention with Squeeze-and-Excitation mechanisms, forming a composite block termed TripSE. This innovative fusion seeks to improve the extraction and representation of features in vision tasks, particularly facial expression recognition, by addressing both spatial and channel attention in tandem.

Research Context

Convolutional Neural Networks (CNNs) have been pivotal in advancing computer vision, thanks to their capability to capture complex patterns effectively. Architectural developments over the years, such as ResNet, DenseNet, and more recently ConvNeXt, have showcased the robustness of CNNs. Attention mechanisms, particularly transformers, briefly shifted focus from CNNs by enhancing learning through self-attention layers. Nonetheless, CNN-based attention mechanisms like SE and Triplet Attention (TA) have shown compelling results in refining feature representations within CNN structures. SE focuses on channel significance while TA captures cross-dimensional interactions. The emergence of ConvNeXt reaffirmed CNNs by introducing architectural modifications that challenge transformer models in various datasets.

Methodology

The TripSE block proposed in this study integrates TA and SE attentions to achieve 3D attention. TA operates by transforming input tensors to capture inter-dimensional relationships across width, height, and channels. SE, on the other hand, emphasizes inter-channel dependencies by producing a scaling vector for convolutional feature maps. The TripSE block inherently combines these attention mechanisms:

Triplet Attention: Captures spatial and channel relations using rotational channels to understand "where to pay attention."
Squeeze-and-Excitation: Focuses on channel importance, answering "what to pay attention to."

Several variants of TripSE blocks are explored, where rotation, pooling, and normalization process from TA are combined with SE’s channel-wise weighting. This aims to reconcile global channel attention with spatial feature extraction, producing sophisticated attention maps that enhance discrimination of facial expressions.

Experimental Evaluation

The paper evaluates the efficacy of integrating TripSE into prominent architectural models—ResNet18, DenseNet, and ConvNeXt—and applies them to benchmark datasets like CIFAR100, ImageNet, FER2013, and AffectNet. The ConvNeXt architecture, when integrated with TripSE, surpassed its baseline performance and achieved a new benchmark accuracy on FER2013 of 78.27%. This illustrates the utility of 3D attention for facial expression recognition, a challenging domain where spatial and channel features must be intricately balanced.

Implications and Future Directions

These findings emphasize reinforcing CNN architectures via novel attention mechanisms, which are instrumental in tasks demanding fine-grained feature analysis. TripSE's ability to integrate spatial and channel attentions offers potential theoretical advancements in 3D attention modeling. Practically, this paves the way for creating more effective models for real-time facial expression recognition and other similar applications.

Future work may explore further tuning of TripSE configurations and application to different domains, assessing its applicability in diverse vision tasks. Additionally, investigating optimizations that minimize computational overhead can guide deployment for resource-constrained applications. Research may also delve into integrating attention mechanisms with evolving CNN architectures, balancing complexity with performance gains.