Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality

Published 31 Mar 2025 in cs.LG and cs.AI | (2503.24277v2)

Abstract: Sparse autoencoders (SAEs) are widely used in mechanistic interpretability research for LLMs; however, the state-of-the-art method of using $k$-sparse autoencoders lacks a theoretical grounding for selecting the hyperparameter $k$ that represents the number of nonzero activations, often denoted by $\ell_0$. In this paper, we reveal a theoretical link that the $\ell_2$-norm of the sparse feature vector can be approximated with the $\ell_2$-norm of the dense vector with a closed-form error, which allows sparse autoencoders to be trained without the need to manually determine $\ell_0$. Specifically, we validate two applications of our theoretical findings. First, we introduce a new methodology that can assess the feature activations of pre-trained SAEs by computing the theoretically expected value from the input embedding, which has been overlooked by existing SAE evaluation methods and loss functions. Second, we introduce a novel activation function, top-AFA, which builds upon our formulation of approximate feature activation (AFA). This function enables top-$k$ style activation without requiring a constant hyperparameter $k$ to be tuned, dynamically determining the number of activated features for each input. By training SAEs on three intermediate layers to reconstruct GPT2 hidden embeddings for over 80 million tokens from the OpenWebText dataset, we demonstrate the empirical merits of this approach and compare it with current state-of-the-art $k$-sparse autoencoders. Our code is available at: https://github.com/SewoongLee/top-afa-sae.

Abstract PDF Upgrade to Chat

Summary

The paper introduces Approximate Feature Activation (AFA) to dynamically adapt feature sparsity, eliminating the fixed-k hyperparameter.
It employs ZF plots and ε-quasi-orthogonality constraints to diagnose over- and under-activation with provable error bounds.
Empirical results demonstrate that the top-AFA approach outperforms fixed-k methods in reconstructing GPT-2 embeddings for enhanced interpretability.

Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality

Sparse autoencoders (SAEs) have been integral in enhancing the interpretability of LLMs, primarily by decomposing dense embeddings into sparse, interpretable feature vectors. Despite their utility, the state-of-the-art $k$ -sparse autoencoders (SAEs) lack theoretical grounding in selecting the hyperparameter $k$ , representing the number of nonzero activations. This paper introduces a theoretically grounded approach to address this issue by approximating the $\ell_2$ -norm of sparse feature vectors using the dense input vector’s norm, thus eliminating the need for manually setting $k$ . The authors explore applications of this theoretical link in evaluating and designing SAEs.

Theoretical Contributions

The primary theoretical contribution is the Approximate Feature Activation (AFA), which provides a closed-form approximation of the $\ell_2$ -norm of sparse feature vectors. This approximation comes with provable error bounds. The authors also introduce a method for diagnosing over- or under-activation of features through ZF plots. Additionally, the paper formalizes $\varepsilon$ -quasi-orthogonality as a constraint related to the superposition hypothesis, forming a basis for evaluating SAEs using a novel metric, $\varepsilon_{\text{LBO}}$ , representing the lower bound of quasi-orthogonality.

Figure 1: Comparison of SAE evaluation approaches and activation selection.

Methodological Advancements

The novel activation function, top-AFA, is a significant methodological advancement. It dynamically selects the number of active features by aligning activation norms with input norms, removing the dependency on a fixed hyperparameter $k$ . This adaptive approach contrasts with fixed-k methodologies, enabling SAEs to better match their feature activations with input embeddings.

Figure 2: Relationship between dense embedding vector norm and learned feature activation norm in ZF plots.

The paper validates these concepts by training SAEs on intermediate layers to reconstruct hidden embeddings from GPT-2, achieving superior performance over $k$ -sparse autoencoders. This empirical validation showcases the practical applicability of the theoretical framework.

Empirical Evaluation

Empirical results demonstrate that top-AFA outperforms top- $k$ and batch top- $k$ activation functions across multiple layers in the GPT-2 architecture. The paper presents compelling evidence that adaptively selecting the number of activations based on input norms can exceed the performance boundaries inherent in fixed-k methods.

Figure 3: Geometrical intuition behind ZF plots and activation norm mismatch.

Implications and Future Directions

The implications of this research are significant for both the theoretical and practical application of SAEs in LLM interpretability. By providing a theoretically justified method to determine sparsity without a fixed hyperparameter, this work opens pathways for more adaptive architectures in other neural network applications focused on sparse representations.

Future research could explore extending the AFA framework to more complex network structures, such as multi-layer models or those incorporating non-linear transformation layers, potentially enhancing the adaptability and performance of SAEs in increasingly complex tasks.

Conclusion

The paper delivers a significant theoretical and practical contribution to the field of sparse autoencoders by introducing a framework that does not rely on a fixed sparsity level but instead adapts dynamically to input characteristics. This advancement addresses core limitations in current models and sets the stage for more robust, theoretically grounded autoencoders that can be leveraged across a variety of applications within the field of LLM interpretability and beyond.

In summary, the introduction of the Approximate Feature Activation and its empirical validation represent a notable step forward in the design and evaluation of Sparse Autoencoders, enhancing their interpretability, adaptability, and potential application range.