A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments
Abstract: We respond to the paper by Makelov et al. (2023), which reviews subspace interchange intervention methods like distributed alignment search (DAS; Geiger et al. 2023) and claims that these methods potentially cause "interpretability illusions". We first review Makelov et al. (2023)'s technical notion of what an "interpretability illusion" is, and then we show that even intuitive and desirable explanations can qualify as illusions in this sense. As a result, their method of discovering "illusions" can reject explanations they consider "non-illusory". We then argue that the illusions Makelov et al. (2023) see in practice are artifacts of their training and evaluation paradigms. We close by emphasizing that, though we disagree with their core characterization, Makelov et al. (2023)'s examples and discussion have undoubtedly pushed the field of interpretability forward.
- CEBaB: Estimating the causal effects of real-world concepts on NLP model behavior. In Advances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2205.14140.
- Language models can explain neurons in language models. 2023. URL https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html.
- Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. URL https://transformer-circuits.pub/2023/monosemantic-features/index.html.
- Causal scrubbing: a method for rigorously testing interpretability hypotheses, 2022. URL https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing.
- Are neural nets modular? inspecting functional modularity through differentiable weight masks. In International Conference on Learning Representations (ICLR), 2021. URL https://openreview.net/forum?id=7uVcpu-gMD.
- Sparse interventions in language models with differentiable masking. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 2022. URL https://aclanthology.org/2022.blackboxnlp-1.2.
- Amnesic probing: Behavioral explanation with amnesic counterfactuals. In Proceedings of the 2020 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2020. URL https://arxiv.org/abs/2006.00995.
- Toy models of superposition, 2022. URL https://arxiv.org/abs/2209.10652.
- CausaLM: Causal Model Explanation Through Counterfactual Language Models. In Computational Linguistics, 2021. URL https://doi.org/10.1162/coli_a_00404.
- Neural natural language inference models partially embed theories of lexical entailment and negation. In Proceedings of the 2020 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2020. URL https://arxiv.org/abs/2004.14623.
- Causal abstractions of neural networks. In Advances in Neural Information Processing Systems, volume 34, pp. 9574–9586, 2021. URL https://papers.nips.cc/paper/2021/hash/4f5c422f4d49a5a807eda27434231040-Abstract.html.
- Inducing causal structure for interpretable neural networks. In International Conference on Machine Learning (ICML), 2022. URL https://proceedings.mlr.press/v162/geiger22a.html.
- Finding alignments between interpretable causal variables and distributed neural representations, 2023. URL https://arxiv.org/abs/2303.02536.
- Rigorously assessing natural language explanations of neurons. In Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2023. URL https://arxiv.org/abs/2309.10312.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015. URL https://arxiv.org/abs/1412.6980.
- An interpretability illusion for activation patching of arbitrary subspaces. LessWrong, 2023. URL https://www.lesswrong.com/posts/RFtkRXHebkwxygDe2/an-interpretability-illusion-for-activation-patching-of#comments.
- Is this the subspace you are looking for? An interpretability illusion for subspace activation patching. arXiv preprint arXiv:2311.17030, 2023. URL https://arxiv.org/abs/2311.17030.
- Parallel Distributed Processing. Volume 2: Psychological and Biological Models. MIT Press, Cambridge, MA, 1986.
- Linguistic regularities in continuous space word representations. In Association for Computational Linguistics (ACL), 2013. URL https://aclanthology.org/N13-1090/.
- Emergent linear representations in world models of self-supervised sequence models. In Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2023. URL https://aclanthology.org/2023.blackboxnlp-1.2.
- Zoom in: An introduction to circuits. 2020. URL https://distill.pub/2020/circuits/zoom-in.
- The linear representation hypothesis and the geometry of large language models, 2023. URL https://arxiv.org/abs/2311.03658.
- Null it out: Guarding protected attributes by Iterative Nullspace Projection. In Association for Computational Linguistics (ACL), 2020. URL https://doi.org/10.18653/v1/2020.acl-main.647.
- Parallel Distributed Processing. Volume 1: Foundations. MIT Press, Cambridge, MA, 1986.
- P. Smolensky. Neural and conceptual interpretation of PDP models. In Parallel Distributed Processing: Explorations in the Microstructure, Vol. 2: Psychological and Biological Models. MIT Press, Cambridge, MA, USA, 1986.
- Investigating gender bias in language models using causal mediation analysis. In Advances in Neural Information Processing Systems (NeurIPS), 2020. URL https://arxiv.org/abs/2004.12265.
- Interpretability in the wild: A circuit for indirect object identification in GPT-2 small. In International Conference on Learning Representations (ICLR), 2023. URL https://arxiv.org/abs/2211.00593.
- Interpretability at scale: Identifying causal mechanisms in Alpaca. In Advances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2305.08809.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.