Decomposing and Editing Predictions by Modeling Model Computation
Abstract: How does the internal computation of a machine learning model transform inputs into predictions? In this paper, we introduce a task called component modeling that aims to address this question. The goal of component modeling is to decompose an ML model's prediction in terms of its components -- simple functions (e.g., convolution filters, attention heads) that are the "building blocks" of model computation. We focus on a special case of this task, component attribution, where the goal is to estimate the counterfactual impact of individual components on a given prediction. We then present COAR, a scalable algorithm for estimating component attributions; we demonstrate its effectiveness across models, datasets, and modalities. Finally, we show that component attributions estimated with COAR directly enable model editing across five tasks, namely: fixing model errors, ``forgetting'' specific classes, boosting subpopulation robustness, localizing backdoor attacks, and improving robustness to typographic attacks. We provide code for COAR at https://github.com/MadryLab/modelcomponents .
- “Understanding intermediate layers using linear classifier probes” In arXiv preprint arXiv:1610.01644, 2016
- “On the pitfalls of analyzing individual neurons in language models” In arXiv preprint arXiv:2110.07483, 2021
- Friedrich L Bauer “Computational graphs and rounding error” In SIAM Journal on Numerical Analysis 11.1 SIAM, 1974, pp. 87–96
- “Language models can explain neurons in language models” In URL https://openaipublic. blob. core. windows. net/neuron-explainer/paper/index. html.(Date accessed: 14.05. 2023), 2023
- “Discovering Knowledge-Critical Subnetworks in Pretrained Language Models” In arXiv preprint arXiv:2310.03084, 2023
- “Gender shades: Intersectional accuracy disparities in commercial gender classification” In Conference on fairness, accountability and transparency (FAccT), 2018
- “Robustness of edited neural networks” In ArXiv abs/2303.00046, 2023
- “Rewriting a deep generative model” In European Conference on Computer Vision (ECCV), 2020
- Terra Blevins, Omer Levy and Luke Zettlemoyer “Deep RNNs encode soft hierarchical syntax” In arXiv preprint arXiv:1805.04218, 2018
- Battista Biggio, Blaine Nelson and Pavel Laskov “Poisoning attacks against support vector machines” In International Conference on Machine Learning, 2012
- “An interpretability illusion for bert” In arXiv preprint arXiv:2104.07143, 2021
- Sara Beery, Grant Van Horn and Pietro Perona “Recognition in terra incognita” In European Conference on Computer Vision (ECCV), 2018
- “Network dissection: Quantifying interpretability of deep visual representations” In Computer Vision and Pattern Recognition (CVPR), 2017
- “Understanding the role of individual units in a deep neural network” In Proceedings of the National Academy of Sciences (PNAS), 2020
- “Evaluating the Ripple Effects of Knowledge Editing in Language Models” In ArXiv abs/2307.12976, 2023
- “Curve detectors” In Distill 5.6, 2020, pp. e00024–003
- “Causal scrubbing: A method for rigorously testing interpretability hypotheses”, 2022
- “BoolQ: Exploring the surprising difficulty of natural yes/no questions” In arXiv preprint arXiv:1905.10044, 2019
- “Towards automated circuit discovery for mechanistic interpretability” In arXiv preprint arXiv:2304.14997, 2023
- Steven Cao, Victor Sanh and Alexander M Rush “Low-complexity probing via finding subnetworks” In arXiv preprint arXiv:2104.03514, 2021
- Ting-Yun Chang, Jesse Thomason and Robin Jia “Do Localization Methods Actually Localize Memorized Data in LLMs?” In arXiv preprint arXiv:2311.09060, 2023
- “Interpreting and Controlling Vision Foundation Models via Text Explanations” In arXiv preprint arXiv:2310.10591, 2023
- Nicola De Cao, Wilker Aziz and Ivan Titov “Editing factual knowledge in language models” In arXiv preprint arXiv:2104.08164, 2021
- “An image is worth 16x16 words: Transformers for image recognition at scale” In International Conference on Learning Representations (ICLR), 2021
- “Knowledge neurons in pretrained transformers” In arXiv preprint arXiv:2104.08696, 2021
- “Imagenet: A large-scale hierarchical image database” In Computer Vision and Pattern Recognition (CVPR), 2009
- “What is one grain of sand in the desert? analyzing individual neurons in deep nlp models” In Proceedings of the AAAI Conference on Artificial Intelligence 33.01, 2019, pp. 6309–6317
- “Analyzing individual neurons in pre-trained language models” In arXiv preprint arXiv:2010.02695, 2020
- “Sparse interventions in language models with differentiable masking” In arXiv preprint arXiv:2112.06837, 2021
- Kedar Dhamdhere, Mukund Sundararajan and Qiqi Yan “How important is a neuron?” In arXiv preprint arXiv:1805.12233, 2018
- “Toy models of superposition” In arXiv preprint arXiv:2209.10652, 2022
- “TinyStories: How Small Can Language Models Be and Still Speak Coherent English?” In arXiv preprint arXiv:2305.07759, 2023
- “What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation” In Advances in Neural Information Processing Systems (NeurIPS) 33, 2020, pp. 2881–2891
- “Multimodal neurons in artificial neural networks” In Distill, 2021
- Tianyu Gu, Brendan Dolan-Gavitt and Siddharth Garg “Badnets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain” In arXiv preprint arXiv:1708.06733, 2017
- Yossi Gandelsman, Alexei A Efros and Jacob Steinhardt “Interpreting CLIP’s Image Representation via Text-Based Decomposition” In arXiv preprint arXiv:2310.05916, 2023
- “Model patching: Closing the subgroup performance gap with data augmentation” In arXiv preprint arXiv:2008.06775, 2020
- “Shortcut learning in deep neural networks” In Nature Machine Intelligence, 2020
- “What do vision transformers learn? a visual exploration” In arXiv preprint arXiv:2212.06727, 2022
- “Causal abstractions of neural networks” In Advances in Neural Information Processing Systems 34, 2021, pp. 9574–9586
- “Erasing concepts from diffusion models” In arXiv preprint arXiv:2303.07345, 2023
- “Localizing model behavior with path patching” In arXiv preprint arXiv:2304.05969, 2023
- Atticus Geiger, Chris Potts and Thomas Icard “Causal abstraction for faithful model interpretation” In arXiv preprint arXiv:2301.04709, 2023
- “ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.” In International Conference on Learning Representations (ICLR), 2019
- Ian J Goodfellow, Jonathon Shlens and Christian Szegedy “Explaining and Harnessing Adversarial Examples” In International Conference on Learning Representations (ICLR), 2015
- “A framework for few-shot language model evaluation” Zenodo, 2023 DOI: 10.5281/zenodo.10256836
- “The journey, not the destination: How data guides diffusion models” In arXiv preprint arXiv:2312.06205, 2023
- Amirata Ghorbani and James Y Zou “Neuron shapley: Discovering the responsible neurons” In Advances in neural information processing systems 33, 2020, pp. 5922–5932
- “Don’t trust your eyes: on the (un) reliability of feature visualizations” In arXiv preprint arXiv:2306.04719, 2023
- “Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models” In arXiv preprint arXiv:2301.04213, 2023
- Dan Hendrycks and Thomas G. Dietterich “Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations” In International Conference on Learning Representations (ICLR), 2019
- “A benchmark for interpretability methods in deep neural networks” In arXiv preprint arXiv:1806.10758, 2018
- “Rigorously Assessing Natural Language Explanations of Neurons” In arXiv preprint arXiv:2309.10312, 2023
- “Designing and interpreting probes with control tasks” In arXiv preprint arXiv:1909.03368, 2019
- “On the Foundations of Shortcut Learning” In arXiv preprint arXiv:2310.16228, 2023
- “Natural language descriptions of deep visual features” In International Conference on Learning Representations, 2021
- “Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors” In arXiv preprint arXiv:2211.11031, 2022
- “Transformer-Patcher: One Mistake worth One Neuron” In arXiv preprint arXiv:2301.09785, 2023
- “Deep Residual Learning for Image Recognition” In Conference on Computer Vision and Pattern Recognition (CVPR), 2015
- “Simple data balancing achieves competitive worst-group-accuracy” In Conference on Causal Learning and Reasoning, 2022, pp. 336–351 PMLR
- “Datamodels: Predicting Predictions from Training Data” In International Conference on Machine Learning (ICML), 2022
- “Editing models with task arithmetic” In arXiv preprint arXiv:2212.04089, 2022
- “Phi-2: The surprising power of small language models” In Microsoft Research, 2023 URL: https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
- “MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs” In arXiv preprint arXiv:1901.07042, 2019
- Polina Kirichenko, Pavel Izmailov and Andrew Gordon Wilson “Last layer re-training is sufficient for robustness to spurious correlations” In arXiv preprint arXiv:2204.02937, 2022
- “Sgd on neural networks learns functions of increasing complexity” In Advances in neural information processing systems 32, 2019
- “AtP*: An efficient and scalable method for localizing LLM behaviour to components” In arXiv preprint arXiv:2403.00745, 2024
- “Captum: A unified and generic model interpretability library for pytorch” In arXiv preprint arXiv:2009.07896, 2020
- “Similarity of Neural Network Representations Revisited” In Proceedings of the 36th International Conference on Machine Learning (ICML), 2019
- Alex Krizhevsky “Learning Multiple Layers of Features from Tiny Images” In Technical report, 2009
- “Textbooks are all you need ii: phi-1.5 technical report” In arXiv preprint arXiv:2309.05463, 2023
- “ffcv”, https://github.com/libffcv/ffcv/, 2022
- “The emergence of number and syntax units in LSTM language models” In arXiv preprint arXiv:1903.07435, 2019
- “Deep Learning Face Attributes in the Wild” In International Conference on Computer Vision (ICCV), 2015
- “Influence-directed explanations for deep convolutional networks” In 2018 IEEE international test conference (ITC), 2018, pp. 1–8 IEEE
- “Celeb-df: A large-scale challenging dataset for deepfake forensics” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 3207–3216
- “Compositional explanations of neurons” In Advances in Neural Information Processing Systems 33, 2020, pp. 17153–17163
- “Locating and Editing Factual Associations in GPT” In Advances in Neural Information Processing Systems 36, 2022
- “Fast model editing at scale” In arXiv preprint arXiv:2110.11309, 2021
- “Can Neural Network Memorization Be Localized?” In International Conference on Machine Learning, 2023
- Joanna Materzyńska, Antonio Torralba and David Bau “Disentangling visual and written concepts in clip” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16410–16419
- “Progress measures for grokking via mechanistic interpretability” In arXiv preprint arXiv:2301.05217, 2023
- “An Overview of Early Vision in InceptionV1” In Distill, 2020 DOI: 10.23915/distill.00024.002
- “Zoom In: An Introduction to Circuits” In Distill, 2020 DOI: 10.23915/distill.00024.001
- “Hidden stratification causes clinically meaningful failures in machine learning for medical imaging” In Proceedings of the ACM conference on health, inference, and learning, 2020
- “In-context learning and induction heads” In arXiv preprint arXiv:2209.11895, 2022
- “The Building Blocks of Interpretability” In Distill, 2018
- “Clip-dissect: Automatic description of neuron representations in deep vision networks” In arXiv preprint arXiv:2204.10965, 2022
- “TRAK: Attributing Model Behavior at Scale” In Arxiv preprint arXiv:2303.14186, 2023
- “Task-Specific Skill Localization in Fine-tuned Language Models” In arXiv preprint arXiv:2302.06600, 2023
- Alec Radford, Rafal Jozefowicz and Ilya Sutskever “Learning to generate reviews and discovering sentiment” In arXiv preprint arXiv:1704.01444, 2017
- “Learning transferable visual models from natural language supervision” In arXiv preprint arXiv:2103.00020, 2021
- “Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization” In arXiv preprint arXiv:2311.04163, 2023
- Marco Tulio Ribeiro, Sameer Singh and Carlos Guestrin “" Why should I trust you?" Explaining the predictions of any classifier” In International Conference on Knowledge Discovery and Data Mining (KDD), 2016
- “Linear adversarial concept erasure” In International Conference on Machine Learning, 2022, pp. 18400–18421 PMLR
- “Language Models are Unsupervised Multitask Learners”, 2019
- Alessandro Stolfo, Yonatan Belinkov and Mrinmaya Sachan “Understanding Arithmetic Reasoning in Language Models using Causal Mediation Analysis” In arXiv preprint arXiv:2305.15054, 2023
- “The woman worked as a babysitter: On biases in language generation” In arXiv preprint arXiv:1909.01326, 2019
- Harshay Shah, Prateek Jain and Praneeth Netrapalli “Do Input Gradients Highlight Discriminative Features?” In Advances in Neural Information Processing Systems 34, 2021
- “Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization” In International Conference on Learning Representations, 2020
- “Modeldiff: A framework for comparing learning algorithms” In International Conference on Machine Learning, 2023, pp. 30646–30688 PMLR
- Aaquib Syed, Can Rager and Arthur Conmy “Attribution Patching Outperforms Automated Circuit Discovery” In arXiv preprint arXiv:2310.10348, 2023
- “Editing a classifier by rewriting its prediction rules” In Preprint, 2021
- “The pitfalls of simplicity bias in neural networks” In Advances in Neural Information Processing Systems 33, 2020, pp. 9573–9585
- Mukund Sundararajan, Ankur Taly and Qiqi Yan “Axiomatic attribution for deep networks” In International Conference on Machine Learning (ICML), 2017
- Karen Simonyan, Andrea Vedaldi and Andrew Zisserman “Deep inside convolutional networks: Visualising image classification models and saliency maps” In arXiv preprint arXiv:1312.6034, 2013
- “Investigating gender bias in language models using causal mediation analysis” In Advances in neural information processing systems 33, 2020, pp. 12388–12401
- “Dataset interfaces: Diagnosing model failures using controllable counterfactual generation” In arXiv preprint arXiv:2302.07865, 2023
- “Attention is All you Need” In Advances in Neural Information Processing Systems, 2017
- “The caltech-ucsd birds-200-2011 dataset” California Institute of Technology, 2011
- “Learning robust global representations by penalizing local predictive power” In Neural Information Processing Systems (NeurIPS), 2019
- “Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars” In arXiv preprint arXiv:2312.01429, 2023
- “Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small” arXiv, 2022 DOI: 10.48550/ARXIV.2211.00593
- “A comprehensive survey of forgetting in deep learning beyond continual learning” In arXiv preprint arXiv:2307.09218, 2023
- Matthew D Zeiler and Rob Fergus “Visualizing and understanding convolutional networks” In European conference on computer vision, 2014, pp. 818–833 Springer
- “Places: A 10 million image database for scene recognition” In IEEE transactions on pattern analysis and machine intelligence, 2017
- “Towards Best Practices of Activation Patching in Language Models: Metrics and Methods” In arXiv preprint arXiv:2309.16042, 2023
- “Representation engineering: A top-down approach to ai transparency” In arXiv preprint arXiv:2310.01405, 2023
- “Intriguing Properties of Data Attribution on Diffusion Models” In arXiv preprint arXiv:2311.00500, 2023
- “Modifying memories in transformer models” In arXiv preprint arXiv:2012.00363, 2020
- “Revisiting the importance of individual units in cnns via ablation” In arXiv preprint arXiv:1806.02891, 2018
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.