SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision

Published 24 Apr 2024 in cs.CV and cs.AI | (2404.15721v2)

Abstract: Selective attention helps us focus on task-relevant aspects in the constant flood of our sensory input. This constraint in our perception allows us to robustly generalize under distractions and to new compositions of perceivable concepts. Transformers employ a similar notion of attention in their architecture, but representation learning models with transformer backbones like CLIP and DINO often fail to demonstrate robustness and compositionality. We highlight a missing architectural prior: unlike human perception, transformer encodings do not separately attend over individual concepts. In response, we propose SPARO, a read-out mechanism that partitions encodings into separately-attended slots, each produced by a single attention head. Using SPARO with CLIP imparts an inductive bias that the vision and text modalities are different views of a shared compositional world with the same corresponding concepts. Using SPARO, we demonstrate improvements on downstream recognition, robustness, retrieval, and compositionality benchmarks with CLIP (up to +14% for ImageNet, +4% for SugarCrepe), and on nearest neighbors and linear probe for ImageNet with DINO (+3% each). We also showcase a powerful ability to intervene and select individual SPARO concepts to further improve downstream task performance (up from +4% to +9% for SugarCrepe) and use this ability to study the robustness of SPARO's representation structure. Finally, we provide insights through ablation experiments and visualization of learned concepts.

Abstract PDF Upgrade to Chat

Summary

The paper presents Sparo, a method that modifies transformer outputs by partitioning encodings into distinct concept slots for enhanced robustness and generalization.
It demonstrates up to a 14% recognition boost on ImageNet, a 4% improvement in noise robustness, and significant gains in compositional tasks.
The architecture yields interpretable representations by aligning with human selective attention, paving the way for more flexible and fine-tunable vision models.

Enhancing Vision Transformer Robustness and Compositional Generalization through Selective Attention with Sparo

Introduction to Sparo

Selectively directing attention to relevant stimuli from extraneous ones is a crucial ability in human perception. In this paper, selective attention principles are applied to improve the robustness and compositional generalization in Transformer-based models for vision tasks. The proposed method, dubbed Sparo (Separate-head attention read-out), enhances the output mechanism of transformers by concentrating on individual concept pieces within an image or text, inspired by human cognitive processes. The method significantly outperforms traditional approaches such as CLIP and DINO on various benchmarks, demonstrating its effectiveness in real-world scenarios.

Sparo Architecture

Sparo modifies the typical Transformer architecture by replacing the last layer with a bespoke module that partitions the encoding into separately-attended slots. Each slot corresponds to a distinct concept as determined by a single attention head. This structure induces a robust partition of concepts, allowing for a more organized and interpretable internal representation. The Sparo-enhanced model possesses several advantages:

Enhanced Generalization: By focusing on separate concepts, Sparo allows the model to better generalize across different tasks and datasets.
Robustness to Noise: The ability to focus selectively makes the model less sensitive to irrelevant variations and noise in the input data.
Improved Compositionality: The decomposed nature of the output makes it easier for the model to handle compositional tasks, where the relationship between different parts of the input is crucial.
Flexibility: The method can be applied to any Transformer model and is compatible with various tasks and modalities.

Performance Evaluation

Sparo was evaluated across multiple benchmarks, including zero-shot recognition on ImageNet, robustness tests across altered ImageNet datasets, and compositionality through tasks like SugarCrepe. The results demonstrate considerable improvements:

On ImageNet, Sparo configurations achieved up to a 14% improvement in recognition tasks.
In robustness tests, a 4% increase was observed, indicating better generalization under various visual distortions.
For compositional benchmarks like SugarCrepe, improvements ranged from 4% up to 9%, highlighting Sparo’s ability to handle complex input relationships effectively.

Ablation Studies and Insight

Further ablation studies and interventions were conducted to understand the necessity of each architectural choice within Sparo:

Single-Head Attention: Focusing on single-head attention ensured that each slot captures distinct, non-overlapping conceptual information, which is crucial for the robust compositional learning observed.
Slot Contribution Analysis: By manipulating which slots to consider in outputs, researchers could gain insights into what parts of the data the model deems crucial for specific tasks. This post-hoc analysis allowed for fine-grained understanding and further tuning of the model for specialized tasks.

Theoretical and Practical Implications

The advancements proposed with Sparo suggest considerable theoretical implications for the design of machine learning systems:

Understanding Attention: Sparo provides a practical application of cognitive science principles, specifically selective attention, in a computational context, offering a bridge between human cognitive processes and machine learning models.
Compositional Learning: By achieving significant gains in compositional benchmarks, Sparo pushes the frontier in how machines handle and understand compositional information—an area traditionally challenging for AI systems.

Future Directions

Exploring Sparo in different contexts and for varied tasks opens avenues for research, particularly in how selective attention can be further refined and utilized in AI. Integrating Sparo with different types of data, refining slot mechanisms, and exploring dynamic slot allocation based on input complexity are potential areas for further investigation.

In conclusion, Sparo introduces a refined framework for implementing selective attention in transformers, enhancing their robustness, generalization, and ability to handle compositional data in vision tasks. The ability to selectively attend to specific, non-overlapping components within the data invites not only improved performance but also more interpretable and fine-tunable models aligned with human cognitive abilities.

Markdown Report Issue