Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments

Published 23 Jan 2024 in cs.LG, cs.AI, and cs.CL | (2401.12631v1)

Abstract: We respond to the paper by Makelov et al. (2023), which reviews subspace interchange intervention methods like distributed alignment search (DAS; Geiger et al. 2023) and claims that these methods potentially cause "interpretability illusions". We first review Makelov et al. (2023)'s technical notion of what an "interpretability illusion" is, and then we show that even intuitive and desirable explanations can qualify as illusions in this sense. As a result, their method of discovering "illusions" can reject explanations they consider "non-illusory". We then argue that the illusions Makelov et al. (2023) see in practice are artifacts of their training and evaluation paradigms. We close by emphasizing that, though we disagree with their core characterization, Makelov et al. (2023)'s examples and discussion have undoubtedly pushed the field of interpretability forward.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. CEBaB: Estimating the causal effects of real-world concepts on NLP model behavior. In Advances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2205.14140.
  2. Language models can explain neurons in language models. 2023. URL https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html.
  3. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. URL https://transformer-circuits.pub/2023/monosemantic-features/index.html.
  4. Causal scrubbing: a method for rigorously testing interpretability hypotheses, 2022. URL https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing.
  5. Are neural nets modular? inspecting functional modularity through differentiable weight masks. In International Conference on Learning Representations (ICLR), 2021. URL https://openreview.net/forum?id=7uVcpu-gMD.
  6. Sparse interventions in language models with differentiable masking. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 2022. URL https://aclanthology.org/2022.blackboxnlp-1.2.
  7. Amnesic probing: Behavioral explanation with amnesic counterfactuals. In Proceedings of the 2020 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2020. URL https://arxiv.org/abs/2006.00995.
  8. Toy models of superposition, 2022. URL https://arxiv.org/abs/2209.10652.
  9. CausaLM: Causal Model Explanation Through Counterfactual Language Models. In Computational Linguistics, 2021. URL https://doi.org/10.1162/coli_a_00404.
  10. Neural natural language inference models partially embed theories of lexical entailment and negation. In Proceedings of the 2020 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2020. URL https://arxiv.org/abs/2004.14623.
  11. Causal abstractions of neural networks. In Advances in Neural Information Processing Systems, volume 34, pp.  9574–9586, 2021. URL https://papers.nips.cc/paper/2021/hash/4f5c422f4d49a5a807eda27434231040-Abstract.html.
  12. Inducing causal structure for interpretable neural networks. In International Conference on Machine Learning (ICML), 2022. URL https://proceedings.mlr.press/v162/geiger22a.html.
  13. Finding alignments between interpretable causal variables and distributed neural representations, 2023. URL https://arxiv.org/abs/2303.02536.
  14. Rigorously assessing natural language explanations of neurons. In Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2023. URL https://arxiv.org/abs/2309.10312.
  15. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015. URL https://arxiv.org/abs/1412.6980.
  16. An interpretability illusion for activation patching of arbitrary subspaces. LessWrong, 2023. URL https://www.lesswrong.com/posts/RFtkRXHebkwxygDe2/an-interpretability-illusion-for-activation-patching-of#comments.
  17. Is this the subspace you are looking for? An interpretability illusion for subspace activation patching. arXiv preprint arXiv:2311.17030, 2023. URL https://arxiv.org/abs/2311.17030.
  18. Parallel Distributed Processing. Volume 2: Psychological and Biological Models. MIT Press, Cambridge, MA, 1986.
  19. Linguistic regularities in continuous space word representations. In Association for Computational Linguistics (ACL), 2013. URL https://aclanthology.org/N13-1090/.
  20. Emergent linear representations in world models of self-supervised sequence models. In Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2023. URL https://aclanthology.org/2023.blackboxnlp-1.2.
  21. Zoom in: An introduction to circuits. 2020. URL https://distill.pub/2020/circuits/zoom-in.
  22. The linear representation hypothesis and the geometry of large language models, 2023. URL https://arxiv.org/abs/2311.03658.
  23. Null it out: Guarding protected attributes by Iterative Nullspace Projection. In Association for Computational Linguistics (ACL), 2020. URL https://doi.org/10.18653/v1/2020.acl-main.647.
  24. Parallel Distributed Processing. Volume 1: Foundations. MIT Press, Cambridge, MA, 1986.
  25. P. Smolensky. Neural and conceptual interpretation of PDP models. In Parallel Distributed Processing: Explorations in the Microstructure, Vol. 2: Psychological and Biological Models. MIT Press, Cambridge, MA, USA, 1986.
  26. Investigating gender bias in language models using causal mediation analysis. In Advances in Neural Information Processing Systems (NeurIPS), 2020. URL https://arxiv.org/abs/2004.12265.
  27. Interpretability in the wild: A circuit for indirect object identification in GPT-2 small. In International Conference on Learning Representations (ICLR), 2023. URL https://arxiv.org/abs/2211.00593.
  28. Interpretability at scale: Identifying causal mechanisms in Alpaca. In Advances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2305.08809.
Citations (4)

Summary

  • The paper reinterprets 'interpretability illusions' as legitimate phenomena inherent to neural subspace operations rather than measurement artifacts.
  • It deconstructs a linear network toy example to reveal that even supposedly non-illusory vectors can exhibit measurable projection effects.
  • Empirical findings from indirect object identification tasks underscore that nullspace variations naturally arise from model input changes.

Introduction

The prevalent notion within neural network interpretation posits that neurons perform unique tasks, a premise that underlies the design of classical interchange intervention methods. Notably, these methods include activation patching, which assumes discrete, causal roles for individual neurons. This paper is a response to recent critiques by Makelov et al. (2023) regarding the Distributed Alignment Search (DAS) – a novel approach that intervenes on neural subspaces rather than individual units. While Makelov et al. claim that the results from such methods could yield "interpretability illusions," the authors of the current paper challenge this assertion, arguing that what Makelov et al. term "illusions" are in fact legitimate facets of network behavior.

Defining "Interpretability Illusion"

Makelov et al.'s notion of an "interpretability illusion" is predicated on the idea that variations along certain neural subspaces should ostensibly have no causal effect on a model's outcomes. Their argument is that if such variations do appear to influence predictions, the resulting interpretations are illusory. This critique stems from observations that nullspace projections of these subspaces – theoretically inactive in terms of downstream impact – are nevertheless showing causal efficacy. The response herein articulates a counter-argument: these findings are not illusions but indicative of the fundamental operations of the networks in question. The authors argue that stringent orthogonality between the nullspace and representation submanifolds is not a practical or theoretically necessary condition.

Revisiting Makelov et al.'s Toy Example

A critical technical observation made by the authors demonstrates that even vectors considered non-illusory by Makelov et al. can still produce "illusion" effects based on Makelov et al.'s own definition. By deconstructing a simple linear network example provided by Makelov et al., the authors illustrate that a vector identified as non-illusory can also have non-zero projections onto the nullspace, leading to a measured "illusion effect." They point out the conceptual flaw in Makelov et al.’s measurement process – the non-illusory vector itself would be classified as producing an illusion, thereby challenging the validity of the proposed "illusion" concept.

Experimental Observations

In an empirical investigation into neural networks trained on tasks such as indirect object identification (IOI), the present paper underscores how variations in null directions of subspaces naturally emerge as a consequence of model input variations. They argue that the presence of so-called illusions as identified by Makelov et al. may indeed be commonplace, yet this does not diminish their relevance or authenticity as interpretative insights. Far from being dismissible as illusions, these are inherent characteristics of how networks encode and process information.

Conclusion

The authors conclude that while their analysis diverges from Makelov et al.’s framing of interpretability illusions, the debate has been fruitful, driving advancements in understanding interpretability. The discourse highlights the importance of recognizing and accounting for the nuanced geometrical relationships within neural subspaces, fostering further discussion and inquiry within the field of AI interpretability.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 52 likes about this paper.