A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Published 22 Sep 2024 in cs.CL and cs.AI | (2409.14507v5)

Abstract: Sparse Autoencoders (SAEs) aim to decompose the activation space of LLMs into human-interpretable latent directions or features. As we increase the number of features in the SAE, hierarchical features tend to split into finer features ("math" may split into "algebra", "geometry", etc.), a phenomenon referred to as feature splitting. However, we show that sparse decomposition and splitting of hierarchical features is not robust. Specifically, we show that seemingly monosemantic features fail to fire where they should, and instead get "absorbed" into their children features. We coin this phenomenon feature absorption, and show that it is caused by optimizing for sparsity in SAEs whenever the underlying features form a hierarchy. We introduce a metric to detect absorption in SAEs, and validate our findings empirically on hundreds of LLM SAEs. Our investigation suggests that varying SAE sizes or sparsity is insufficient to solve this issue. We discuss the implications of feature absorption in SAEs and some potential approaches to solve the fundamental theoretical issues before SAEs can be used for interpreting LLMs robustly and at scale.

Abstract PDF HTML Upgrade to Chat

Citations (2)

View on Semantic Scholar

Summary

The paper demonstrates that sparse autoencoders struggle to balance precision and recall when extracting monosemantic latents from LLM activations.
It identifies feature absorption as a key limitation where intended features fail to activate reliably, compromising model interpretability.
Extensive probing reveals that increased sparsity and width exacerbate feature splitting and absorption, posing challenges for LLM interpretability.

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Introduction

The paper "A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders" proposes an exploration into Sparse Autoencoders (SAEs) within the context of LLMs and how these models manage to decompose dense activations into human-interpretable latents. This research poses two critical questions: the extent to which SAEs can extract monosemantic and interpretable latents from LLM activations, and how alterations in sparsity or size of the SAE impact monosemanticity and interpretability. The paper introduces the notion of "feature absorption," a problematic form of feature-splitting where interpretably-aligned features fail to fire under certain conditions.

SAE Performance on First Letter Identification

The study employs a first-letter identification task to scrutinize the interpretability of SAE latents. The researchers use Linear Regression (LR) probes to establish a baseline performance, comparing these results with SAE latents aligned to the same task. Precision and recall are evaluated, demonstrating that SAEs generally underperform in comparison to linear probes. Results, detailed in Figure 1, highlight that varying the sparsity (L0 norm) and width (neuronal count) of the SAEs does not substantially mitigate this performance gap. Specifically, optimal SAE configurations accomplish high precision but low recall or vice versa, but fail to balance both appropriately.

Feature Absorption: Concept and Case Study

A significant finding of this paper is "feature absorption," defined as cases where an SAE latent seems monosemantic but fails to activate accurately, leading to instances where token-aligned latents absorb the feature. A detailed case study on the latent corresponding to the "starts with S" feature in a specific SAE configuration provides compelling evidence. For example, the SAE latent aligned with this feature typically activates on tokens like "sample" or "stone" but fails on "short," where a different latent aligns instead. Ablation studies, visualized in Figure 2, corroborate the substantial causal impact of these absorbing latents on model behavior, showing that these latents contribute significantly to the misalignment.

Quantifying Feature Splitting and Absorption

The researchers conducted comprehensive probing to quantify the prevalence of feature splitting and feature absorption across different SAE configurations. Feature splitting is identified through k-sparse probing, revealing that wider and more sparse SAEs exhibit higher rates of feature splitting. Figure 3 illustrates that while more sparse SAEs tend to decompose general features into more specific ones, they do not inherently improve interpretability due to the inconsistent alignment of these splits.

Feature absorption is measured by identifying failure points where intended latents do not activate but token-aligned latents do. This phenomenon is shown to increase with SAE sparsity and width, suggesting a trade-off between sparsity and reliable feature alignment. Figures 4a and 4b detail that feature absorption is extensive, thus questioning the reliability of SAEs for precise interpretability tasks.

Implications and Future Directions

The implications of these findings are significant for AI interpretability and safety-critical applications. Feature absorption underlines a profound challenge for methods targeting circuit interpretability and sparse feature combination. If feature absorption remains unaddressed, the reliability of SAE latents as indicators of internal model behavior—such as bias detection or identifying instances of deceptive behavior—is compromised.

Future work should aim at further validation of these results across different model architectures and tasks beyond first-letter identification. The promising avenue of Meta-SAEs to decompose absorbing features must be explored, along with alternative methods such as attribution dictionary learning. Establishing a standardized framework for evaluating and mitigating phenomena like feature absorption is crucial for advancing interpretability methodologies in LLMs.

Conclusion

This paper contributes notably to understanding the limitations and intricacies of using SAEs for interpretability in LLMs. By highlighting the problematic aspect of feature absorption and its impacts on model behavior understanding, it points to critical areas for future research. As we advance in leveraging LLMs for complex tasks, ensuring robust methods for understanding and interpreting these models will be indispensable.