2000 character limit reached
Removing Spurious Correlation from Neural Network Interpretations
Published 3 Dec 2024 in cs.CL, cs.AI, cs.LG, stat.AP, and stat.ME | (2412.02893v1)
Abstract: The existing algorithms for identification of neurons responsible for undesired and harmful behaviors do not consider the effects of confounders such as topic of the conversation. In this work, we show that confounders can create spurious correlations and propose a new causal mediation approach that controls the impact of the topic. In experiments with two LLMs, we study the localization hypothesis and show that adjusting for the effect of conversation topic, toxicity becomes less localized.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.