Scale sparse autoencoder decomposition of contributions to LLM-scale features

Develop scalable training procedures and architectures to apply sparse autoencoder decomposition of contribution vectors at large language model scales, enabling CODEC to operate on high-dimensional LLM features for causal interpretability.

Background

The authors benchmark contribution computation on a 5.44B-parameter LLM (Gemma-3n), showing that contribution calculation is tractable across layers and sequence lengths. However, they note that the subsequent sparse autoencoder (SAE) training step is separate and has not yet been scaled to LLM-sized feature spaces.

Thus, while contribution computation generalizes to LLMs, scaling the decomposition step to handle the dimensionality of LLM features remains unresolved.

References

We note that SAE training is a separate computational step, independent of contribution computation, and scaling it to LLM-scale features is left to future work.

— Causal Interpretation of Neural Network Computations with Contribution Decomposition (2603.06557 - Melander et al., 6 Mar 2026) in Supplemental Material, Section “Runtime measurements and Complexity,” final paragraph

Scale sparse autoencoder decomposition of contributions to LLM-scale features

Background

References

Related Problems