The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision

Published 6 Jun 2024 in cs.LG | (2406.03662v1)

Abstract: Recent work on sparse autoencoders (SAEs) has shown promise in extracting interpretable features from neural networks and addressing challenges with polysemantic neurons caused by superposition. In this paper, we apply SAEs to the early vision layers of InceptionV1, a well-studied convolutional neural network, with a focus on curve detectors. Our results demonstrate that SAEs can uncover new interpretable features not apparent from examining individual neurons, including additional curve detectors that fill in previous gaps. We also find that SAEs can decompose some polysemantic neurons into more monosemantic constituent features. These findings suggest SAEs are a valuable tool for understanding InceptionV1, and convolutional neural networks more generally.

Abstract PDF HTML Upgrade to Chat

Citations (6)

View on Semantic Scholar

Summary

The paper introduces sparse autoencoders to decompose polysemantic neurons, revealing clearer, monosemantic features in InceptionV1.
The methodology employs oversampling and dictionary learning on specific convolutional branches to isolate distinct curve detectors using synthetic data.
The results improve mechanistic interpretability in CNNs, paving the way for advanced feature detection and analysis in vision models.

Sparse Autoencoders and Interpretability in InceptionV1

The paper "The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision" offers insights into the mechanistic interpretability of convolutional neural networks (CNNs), with specific focus on InceptionV1. It presents an analysis leveraging sparse autoencoders (SAEs) to isolate interpretable features, a methodology that addresses the long-standing issue of polysemantic neurons in neural networks, particularly concerning superposition in InceptionV1.

Background and Motivation

The resurgence of mechanistic interpretability research has been stimulated by challenges in deciphering the complex behavior of neural networks. At the forefront of these challenges is the presence of polysemantic neurons, which respond to various unrelated stimuli. This polysemantic nature is often attributed to superposition—a concept where neural activity represents combinations of features, rather than isolated responses.

The seminal work on InceptionV1 had recognized polysemantic neurons as a significant impediment to neuron-based analysis. This study revitalizes interest in the interpretability of CNNs by applying SAEs to InceptionV1, particularly its early vision components. SAEs, previously successful in extracting interpretable features from LLMs, offer a promising avenue for dissecting neural responses within image processing contexts.

Methodological Approach

The core methodological innovation lies in applying SAEs to the activation vectors produced by InceptionV1 as it processes images from the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) dataset. Key hyperparameters for the SAEs were meticulously chosen, with the study employing a strategy of oversampling large activations to prioritize significant stimulus responses and performing dictionary learning on specific convolutional branches of InceptionV1 to accommodate memory constraints.

The paper describes how features are extracted using SAEs by decomposing activation vectors into feature-directed activations. This approach enables the identification of monosemantic features from previously polysemantic neuron groups. Additionally, modifications to SAE training regimes, such as avoiding dead neurons through unconstrained decoder norms, are discussed to ensure high fidelity in feature discovery.

Results and Analysis

The results of this research underscore several critical findings:

Discovery of New Features: SAEs elucidate relatively interpretable features not apparent when neural responses are considered cumulatively at the neuron level. This observation suggests that some neuron activities previously viewed as uninterpretable are in fact contributors to distinct, interpretable features.
Uncovering Missing Curve Detectors: Using synthetic data approaches, the study highlights how SAEs reveal additional curve detectors that fill acknowledged gaps in previously identified neuron families. This discovery is visually reinforced through radial tuning curves and response plots showing distinct orientation and radius preferences for different curve detectors.
Decomposition of Polysemantic Neurons: The decomposition of polysemantic neurons into more coherent, monosemantic features is substantiated, with specific examples such as "double curve" detectors being bifurcated into distinct curve-oriented features under different SAE configurations.

Implications and Future Directions

These findings present significant implications for the field of machine learning interpretability. By facilitating a clearer demarcation of neuron feature space, SAEs empower further circuit analysis and enhance understanding of neural networks' decision-making processes. This study posits that such understanding isn't confined to LLMs and is applicable to vision models like InceptionV1.

The systematic application of SAEs to convolutional branches indicates the potential for specialized feature detection within network subsets, a promising area for future exploration. Moreover, the observation of cross-branch superposition encourages in-depth analyses into the specialization phenomena within CNN architectures.

In conclusion, this paper contributes to the refinement of mechanistic interpretability methodologies, offering a robust framework for enhancing the interpretability of sophisticated neural network architectures. The insights it provides pave the way for greater accountability and understanding of AI models in diverse applications such as autonomous vision systems and complex feature recognition tasks.