A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments

Published 23 Jan 2024 in cs.LG, cs.AI, and cs.CL | (2401.12631v1)

Abstract: We respond to the paper by Makelov et al. (2023), which reviews subspace interchange intervention methods like distributed alignment search (DAS; Geiger et al. 2023) and claims that these methods potentially cause "interpretability illusions". We first review Makelov et al. (2023)'s technical notion of what an "interpretability illusion" is, and then we show that even intuitive and desirable explanations can qualify as illusions in this sense. As a result, their method of discovering "illusions" can reject explanations they consider "non-illusory". We then argue that the illusions Makelov et al. (2023) see in practice are artifacts of their training and evaluation paradigms. We close by emphasizing that, though we disagree with their core characterization, Makelov et al. (2023)'s examples and discussion have undoubtedly pushed the field of interpretability forward.

Abstract PDF HTML Upgrade to Chat

References (28)

Citations (4)

View on Semantic Scholar

Summary

The paper reinterprets 'interpretability illusions' as legitimate phenomena inherent to neural subspace operations rather than measurement artifacts.
It deconstructs a linear network toy example to reveal that even supposedly non-illusory vectors can exhibit measurable projection effects.
Empirical findings from indirect object identification tasks underscore that nullspace variations naturally arise from model input changes.

Introduction

The prevalent notion within neural network interpretation posits that neurons perform unique tasks, a premise that underlies the design of classical interchange intervention methods. Notably, these methods include activation patching, which assumes discrete, causal roles for individual neurons. This paper is a response to recent critiques by Makelov et al. (2023) regarding the Distributed Alignment Search (DAS) – a novel approach that intervenes on neural subspaces rather than individual units. While Makelov et al. claim that the results from such methods could yield "interpretability illusions," the authors of the current paper challenge this assertion, arguing that what Makelov et al. term "illusions" are in fact legitimate facets of network behavior.

Defining "Interpretability Illusion"

Makelov et al.'s notion of an "interpretability illusion" is predicated on the idea that variations along certain neural subspaces should ostensibly have no causal effect on a model's outcomes. Their argument is that if such variations do appear to influence predictions, the resulting interpretations are illusory. This critique stems from observations that nullspace projections of these subspaces – theoretically inactive in terms of downstream impact – are nevertheless showing causal efficacy. The response herein articulates a counter-argument: these findings are not illusions but indicative of the fundamental operations of the networks in question. The authors argue that stringent orthogonality between the nullspace and representation submanifolds is not a practical or theoretically necessary condition.

Revisiting Makelov et al.'s Toy Example

A critical technical observation made by the authors demonstrates that even vectors considered non-illusory by Makelov et al. can still produce "illusion" effects based on Makelov et al.'s own definition. By deconstructing a simple linear network example provided by Makelov et al., the authors illustrate that a vector identified as non-illusory can also have non-zero projections onto the nullspace, leading to a measured "illusion effect." They point out the conceptual flaw in Makelov et al.’s measurement process – the non-illusory vector itself would be classified as producing an illusion, thereby challenging the validity of the proposed "illusion" concept.

Experimental Observations

In an empirical investigation into neural networks trained on tasks such as indirect object identification (IOI), the present paper underscores how variations in null directions of subspaces naturally emerge as a consequence of model input variations. They argue that the presence of so-called illusions as identified by Makelov et al. may indeed be commonplace, yet this does not diminish their relevance or authenticity as interpretative insights. Far from being dismissible as illusions, these are inherent characteristics of how networks encode and process information.

Conclusion

The authors conclude that while their analysis diverges from Makelov et al.’s framing of interpretability illusions, the debate has been fruitful, driving advancements in understanding interpretability. The discourse highlights the importance of recognizing and accounting for the nuanced geometrical relationships within neural subspaces, fostering further discussion and inquiry within the field of AI interpretability.