Learning Multi-Level Features with Matryoshka Sparse Autoencoders

Published 21 Mar 2025 in cs.LG and cs.AI | (2503.17547v1)

Abstract: Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting neural networks by extracting the concepts represented in their activations. However, choosing the size of the SAE dictionary (i.e. number of learned concepts) creates a tension: as dictionary size increases to capture more relevant concepts, sparsity incentivizes features to be split or absorbed into more specific features, leaving high-level features missing or warped. We introduce Matryoshka SAEs, a novel variant that addresses these issues by simultaneously training multiple nested dictionaries of increasing size, forcing the smaller dictionaries to independently reconstruct the inputs without using the larger dictionaries. This organizes features hierarchically - the smaller dictionaries learn general concepts, while the larger dictionaries learn more specific concepts, without incentive to absorb the high-level features. We train Matryoshka SAEs on Gemma-2-2B and TinyStories and find superior performance on sparse probing and targeted concept erasure tasks, more disentangled concept representations, and reduced feature absorption. While there is a minor tradeoff with reconstruction performance, we believe Matryoshka SAEs are a superior alternative for practical tasks, as they enable training arbitrarily large SAEs while retaining interpretable features at different levels of abstraction.

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel hierarchical design with Matryoshka SAEs to overcome limitations like feature splitting and absorption in traditional sparse autoencoders.
It employs nested dictionaries that enable multi-level feature learning, balancing broad generalizations with fine-grained details.
Empirical results demonstrate improved latent disentanglement and model performance on tasks such as sparse probing and concept erasure across various datasets.

Learning Multi-Level Features with Matryoshka Sparse Autoencoders

Introduction

The paper "Learning Multi-Level Features with Matryoshka Sparse Autoencoders" (2503.17547) addresses the challenges of interpretability in neural networks using Sparse Autoencoders (SAEs). Traditional SAEs, while effective in teasing apart meaningful features from neural network activations, are often thwarted by issues like feature splitting, absorption, and composition when scaling, due to their flat sparsity constraints. The novel approach proposed in the paper, Matryoshka SAEs, introduces a hierarchical structure to SAEs by training multiple nested dictionaries of increasing sizes. This approach aims to balance the tension between retaining high-level general features and capturing more specific, fine-grained details.

Methodology

Matryoshka SAEs innovatively integrate the concept of hierarchical feature learning reminiscent of Matryoshka dolls—every subset of latents independently reconstructs the input, thereby building a multi-level abstraction hierarchy. By training these nested SAEs simultaneously, the model promotes early latents to learn broad, high-level concepts while later latents explore more specialized features without the incentive to cannibalize the broader features.

This method introduces modest computational overhead due to multiple reconstruction objectives but manages this efficiently. Unlike the conventional sparse autoencoders which often tunnel visioned on reducing active latents despite pathological feature representation, Matryoshka SAEs foster compact yet rich feature spaces that retain comprehensive abstractions.

Empirical Validation

The research demonstrates Matryoshka SAEs' efficacy across various datasets, including synthetic models, TinyStories, and the sizeable Gemma-2-2B architecture. The results show significant improvements in disentangling latent representations, reducing feature absorption, and enhancing performance in sparse probing and concept erasure tasks. The empirical evidence underscores the model's capability to learn interpretable features scale-independently, bridging a critical gap in interpretability research without succumbing to the trade-offs typically observed in large SAEs.

Implications and Future Directions

Matryoshka SAEs represent a promising direction for scaling interpretability mechanisms in neural networks. They showcase how hierarchical constructs can provide robust solutions to common pitfalls in sparse representation learning. This proposal doesn't only alleviate the current limitations faced by interpretability research but also sets a precedent for exploring further hierarchical or nested structures within deep learning models.

The potential to harmoniously scale and maintain interpretability in increasingly complex architectures could empower researchers and practitioners with tools capable of better understanding and steering neural networks. Future research could expand on optimizing Matryoshka SAE configurations, testing across different architectural paradigms, and assessing the human interpretability quality of the latents extracted.

Conclusion

The findings of this study reveal substantial advancements in addressing the interpretability challenges posed by traditional sparse autoencoders through Matryoshka SAEs. By leveraging hierarchical learning techniques, it navigates the intricate balance between preserving general and specific feature integrity across different scales, fundamentally benefiting interpretability-focused neuromodel engineering. With improved downstream task performances and reduced feature distortion risks, Matryoshka SAEs could inform several aspects of AI development, signaling a progressive surge in comprehending and deploying LLMs.

Markdown Report Issue