Papers
Topics
Authors
Recent
Search
2000 character limit reached

Decomposing and Editing Predictions by Modeling Model Computation

Published 17 Apr 2024 in cs.LG, cs.AI, and stat.ML | (2404.11534v1)

Abstract: How does the internal computation of a machine learning model transform inputs into predictions? In this paper, we introduce a task called component modeling that aims to address this question. The goal of component modeling is to decompose an ML model's prediction in terms of its components -- simple functions (e.g., convolution filters, attention heads) that are the "building blocks" of model computation. We focus on a special case of this task, component attribution, where the goal is to estimate the counterfactual impact of individual components on a given prediction. We then present COAR, a scalable algorithm for estimating component attributions; we demonstrate its effectiveness across models, datasets, and modalities. Finally, we show that component attributions estimated with COAR directly enable model editing across five tasks, namely: fixing model errors, ``forgetting'' specific classes, boosting subpopulation robustness, localizing backdoor attacks, and improving robustness to typographic attacks. We provide code for COAR at https://github.com/MadryLab/modelcomponents .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (121)
  1. “Understanding intermediate layers using linear classifier probes” In arXiv preprint arXiv:1610.01644, 2016
  2. “On the pitfalls of analyzing individual neurons in language models” In arXiv preprint arXiv:2110.07483, 2021
  3. Friedrich L Bauer “Computational graphs and rounding error” In SIAM Journal on Numerical Analysis 11.1 SIAM, 1974, pp. 87–96
  4. “Language models can explain neurons in language models” In URL https://openaipublic. blob. core. windows. net/neuron-explainer/paper/index. html.(Date accessed: 14.05. 2023), 2023
  5. “Discovering Knowledge-Critical Subnetworks in Pretrained Language Models” In arXiv preprint arXiv:2310.03084, 2023
  6. “Gender shades: Intersectional accuracy disparities in commercial gender classification” In Conference on fairness, accountability and transparency (FAccT), 2018
  7. “Robustness of edited neural networks” In ArXiv abs/2303.00046, 2023
  8. “Rewriting a deep generative model” In European Conference on Computer Vision (ECCV), 2020
  9. Terra Blevins, Omer Levy and Luke Zettlemoyer “Deep RNNs encode soft hierarchical syntax” In arXiv preprint arXiv:1805.04218, 2018
  10. Battista Biggio, Blaine Nelson and Pavel Laskov “Poisoning attacks against support vector machines” In International Conference on Machine Learning, 2012
  11. “An interpretability illusion for bert” In arXiv preprint arXiv:2104.07143, 2021
  12. Sara Beery, Grant Van Horn and Pietro Perona “Recognition in terra incognita” In European Conference on Computer Vision (ECCV), 2018
  13. “Network dissection: Quantifying interpretability of deep visual representations” In Computer Vision and Pattern Recognition (CVPR), 2017
  14. “Understanding the role of individual units in a deep neural network” In Proceedings of the National Academy of Sciences (PNAS), 2020
  15. “Evaluating the Ripple Effects of Knowledge Editing in Language Models” In ArXiv abs/2307.12976, 2023
  16. “Curve detectors” In Distill 5.6, 2020, pp. e00024–003
  17. “Causal scrubbing: A method for rigorously testing interpretability hypotheses”, 2022
  18. “BoolQ: Exploring the surprising difficulty of natural yes/no questions” In arXiv preprint arXiv:1905.10044, 2019
  19. “Towards automated circuit discovery for mechanistic interpretability” In arXiv preprint arXiv:2304.14997, 2023
  20. Steven Cao, Victor Sanh and Alexander M Rush “Low-complexity probing via finding subnetworks” In arXiv preprint arXiv:2104.03514, 2021
  21. Ting-Yun Chang, Jesse Thomason and Robin Jia “Do Localization Methods Actually Localize Memorized Data in LLMs?” In arXiv preprint arXiv:2311.09060, 2023
  22. “Interpreting and Controlling Vision Foundation Models via Text Explanations” In arXiv preprint arXiv:2310.10591, 2023
  23. Nicola De Cao, Wilker Aziz and Ivan Titov “Editing factual knowledge in language models” In arXiv preprint arXiv:2104.08164, 2021
  24. “An image is worth 16x16 words: Transformers for image recognition at scale” In International Conference on Learning Representations (ICLR), 2021
  25. “Knowledge neurons in pretrained transformers” In arXiv preprint arXiv:2104.08696, 2021
  26. “Imagenet: A large-scale hierarchical image database” In Computer Vision and Pattern Recognition (CVPR), 2009
  27. “What is one grain of sand in the desert? analyzing individual neurons in deep nlp models” In Proceedings of the AAAI Conference on Artificial Intelligence 33.01, 2019, pp. 6309–6317
  28. “Analyzing individual neurons in pre-trained language models” In arXiv preprint arXiv:2010.02695, 2020
  29. “Sparse interventions in language models with differentiable masking” In arXiv preprint arXiv:2112.06837, 2021
  30. Kedar Dhamdhere, Mukund Sundararajan and Qiqi Yan “How important is a neuron?” In arXiv preprint arXiv:1805.12233, 2018
  31. “Toy models of superposition” In arXiv preprint arXiv:2209.10652, 2022
  32. “TinyStories: How Small Can Language Models Be and Still Speak Coherent English?” In arXiv preprint arXiv:2305.07759, 2023
  33. “What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation” In Advances in Neural Information Processing Systems (NeurIPS) 33, 2020, pp. 2881–2891
  34. “Multimodal neurons in artificial neural networks” In Distill, 2021
  35. Tianyu Gu, Brendan Dolan-Gavitt and Siddharth Garg “Badnets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain” In arXiv preprint arXiv:1708.06733, 2017
  36. Yossi Gandelsman, Alexei A Efros and Jacob Steinhardt “Interpreting CLIP’s Image Representation via Text-Based Decomposition” In arXiv preprint arXiv:2310.05916, 2023
  37. “Model patching: Closing the subgroup performance gap with data augmentation” In arXiv preprint arXiv:2008.06775, 2020
  38. “Shortcut learning in deep neural networks” In Nature Machine Intelligence, 2020
  39. “What do vision transformers learn? a visual exploration” In arXiv preprint arXiv:2212.06727, 2022
  40. “Causal abstractions of neural networks” In Advances in Neural Information Processing Systems 34, 2021, pp. 9574–9586
  41. “Erasing concepts from diffusion models” In arXiv preprint arXiv:2303.07345, 2023
  42. “Localizing model behavior with path patching” In arXiv preprint arXiv:2304.05969, 2023
  43. Atticus Geiger, Chris Potts and Thomas Icard “Causal abstraction for faithful model interpretation” In arXiv preprint arXiv:2301.04709, 2023
  44. “ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.” In International Conference on Learning Representations (ICLR), 2019
  45. Ian J Goodfellow, Jonathon Shlens and Christian Szegedy “Explaining and Harnessing Adversarial Examples” In International Conference on Learning Representations (ICLR), 2015
  46. “A framework for few-shot language model evaluation” Zenodo, 2023 DOI: 10.5281/zenodo.10256836
  47. “The journey, not the destination: How data guides diffusion models” In arXiv preprint arXiv:2312.06205, 2023
  48. Amirata Ghorbani and James Y Zou “Neuron shapley: Discovering the responsible neurons” In Advances in neural information processing systems 33, 2020, pp. 5922–5932
  49. “Don’t trust your eyes: on the (un) reliability of feature visualizations” In arXiv preprint arXiv:2306.04719, 2023
  50. “Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models” In arXiv preprint arXiv:2301.04213, 2023
  51. Dan Hendrycks and Thomas G. Dietterich “Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations” In International Conference on Learning Representations (ICLR), 2019
  52. “A benchmark for interpretability methods in deep neural networks” In arXiv preprint arXiv:1806.10758, 2018
  53. “Rigorously Assessing Natural Language Explanations of Neurons” In arXiv preprint arXiv:2309.10312, 2023
  54. “Designing and interpreting probes with control tasks” In arXiv preprint arXiv:1909.03368, 2019
  55. “On the Foundations of Shortcut Learning” In arXiv preprint arXiv:2310.16228, 2023
  56. “Natural language descriptions of deep visual features” In International Conference on Learning Representations, 2021
  57. “Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors” In arXiv preprint arXiv:2211.11031, 2022
  58. “Transformer-Patcher: One Mistake worth One Neuron” In arXiv preprint arXiv:2301.09785, 2023
  59. “Deep Residual Learning for Image Recognition” In Conference on Computer Vision and Pattern Recognition (CVPR), 2015
  60. “Simple data balancing achieves competitive worst-group-accuracy” In Conference on Causal Learning and Reasoning, 2022, pp. 336–351 PMLR
  61. “Datamodels: Predicting Predictions from Training Data” In International Conference on Machine Learning (ICML), 2022
  62. “Editing models with task arithmetic” In arXiv preprint arXiv:2212.04089, 2022
  63. “Phi-2: The surprising power of small language models” In Microsoft Research, 2023 URL: https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
  64. “MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs” In arXiv preprint arXiv:1901.07042, 2019
  65. Polina Kirichenko, Pavel Izmailov and Andrew Gordon Wilson “Last layer re-training is sufficient for robustness to spurious correlations” In arXiv preprint arXiv:2204.02937, 2022
  66. “Sgd on neural networks learns functions of increasing complexity” In Advances in neural information processing systems 32, 2019
  67. “AtP*: An efficient and scalable method for localizing LLM behaviour to components” In arXiv preprint arXiv:2403.00745, 2024
  68. “Captum: A unified and generic model interpretability library for pytorch” In arXiv preprint arXiv:2009.07896, 2020
  69. “Similarity of Neural Network Representations Revisited” In Proceedings of the 36th International Conference on Machine Learning (ICML), 2019
  70. Alex Krizhevsky “Learning Multiple Layers of Features from Tiny Images” In Technical report, 2009
  71. “Textbooks are all you need ii: phi-1.5 technical report” In arXiv preprint arXiv:2309.05463, 2023
  72. “ffcv”, https://github.com/libffcv/ffcv/, 2022
  73. “The emergence of number and syntax units in LSTM language models” In arXiv preprint arXiv:1903.07435, 2019
  74. “Deep Learning Face Attributes in the Wild” In International Conference on Computer Vision (ICCV), 2015
  75. “Influence-directed explanations for deep convolutional networks” In 2018 IEEE international test conference (ITC), 2018, pp. 1–8 IEEE
  76. “Celeb-df: A large-scale challenging dataset for deepfake forensics” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 3207–3216
  77. “Compositional explanations of neurons” In Advances in Neural Information Processing Systems 33, 2020, pp. 17153–17163
  78. “Locating and Editing Factual Associations in GPT” In Advances in Neural Information Processing Systems 36, 2022
  79. “Fast model editing at scale” In arXiv preprint arXiv:2110.11309, 2021
  80. “Can Neural Network Memorization Be Localized?” In International Conference on Machine Learning, 2023
  81. Joanna Materzyńska, Antonio Torralba and David Bau “Disentangling visual and written concepts in clip” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16410–16419
  82. “Progress measures for grokking via mechanistic interpretability” In arXiv preprint arXiv:2301.05217, 2023
  83. “An Overview of Early Vision in InceptionV1” In Distill, 2020 DOI: 10.23915/distill.00024.002
  84. “Zoom In: An Introduction to Circuits” In Distill, 2020 DOI: 10.23915/distill.00024.001
  85. “Hidden stratification causes clinically meaningful failures in machine learning for medical imaging” In Proceedings of the ACM conference on health, inference, and learning, 2020
  86. “In-context learning and induction heads” In arXiv preprint arXiv:2209.11895, 2022
  87. “The Building Blocks of Interpretability” In Distill, 2018
  88. “Clip-dissect: Automatic description of neuron representations in deep vision networks” In arXiv preprint arXiv:2204.10965, 2022
  89. “TRAK: Attributing Model Behavior at Scale” In Arxiv preprint arXiv:2303.14186, 2023
  90. “Task-Specific Skill Localization in Fine-tuned Language Models” In arXiv preprint arXiv:2302.06600, 2023
  91. Alec Radford, Rafal Jozefowicz and Ilya Sutskever “Learning to generate reviews and discovering sentiment” In arXiv preprint arXiv:1704.01444, 2017
  92. “Learning transferable visual models from natural language supervision” In arXiv preprint arXiv:2103.00020, 2021
  93. “Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization” In arXiv preprint arXiv:2311.04163, 2023
  94. Marco Tulio Ribeiro, Sameer Singh and Carlos Guestrin “" Why should I trust you?" Explaining the predictions of any classifier” In International Conference on Knowledge Discovery and Data Mining (KDD), 2016
  95. “Linear adversarial concept erasure” In International Conference on Machine Learning, 2022, pp. 18400–18421 PMLR
  96. “Language Models are Unsupervised Multitask Learners”, 2019
  97. Alessandro Stolfo, Yonatan Belinkov and Mrinmaya Sachan “Understanding Arithmetic Reasoning in Language Models using Causal Mediation Analysis” In arXiv preprint arXiv:2305.15054, 2023
  98. “The woman worked as a babysitter: On biases in language generation” In arXiv preprint arXiv:1909.01326, 2019
  99. Harshay Shah, Prateek Jain and Praneeth Netrapalli “Do Input Gradients Highlight Discriminative Features?” In Advances in Neural Information Processing Systems 34, 2021
  100. “Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization” In International Conference on Learning Representations, 2020
  101. “Modeldiff: A framework for comparing learning algorithms” In International Conference on Machine Learning, 2023, pp. 30646–30688 PMLR
  102. Aaquib Syed, Can Rager and Arthur Conmy “Attribution Patching Outperforms Automated Circuit Discovery” In arXiv preprint arXiv:2310.10348, 2023
  103. “Editing a classifier by rewriting its prediction rules” In Preprint, 2021
  104. “The pitfalls of simplicity bias in neural networks” In Advances in Neural Information Processing Systems 33, 2020, pp. 9573–9585
  105. Mukund Sundararajan, Ankur Taly and Qiqi Yan “Axiomatic attribution for deep networks” In International Conference on Machine Learning (ICML), 2017
  106. Karen Simonyan, Andrea Vedaldi and Andrew Zisserman “Deep inside convolutional networks: Visualising image classification models and saliency maps” In arXiv preprint arXiv:1312.6034, 2013
  107. “Investigating gender bias in language models using causal mediation analysis” In Advances in neural information processing systems 33, 2020, pp. 12388–12401
  108. “Dataset interfaces: Diagnosing model failures using controllable counterfactual generation” In arXiv preprint arXiv:2302.07865, 2023
  109. “Attention is All you Need” In Advances in Neural Information Processing Systems, 2017
  110. “The caltech-ucsd birds-200-2011 dataset” California Institute of Technology, 2011
  111. “Learning robust global representations by penalizing local predictive power” In Neural Information Processing Systems (NeurIPS), 2019
  112. “Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars” In arXiv preprint arXiv:2312.01429, 2023
  113. “Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small” arXiv, 2022 DOI: 10.48550/ARXIV.2211.00593
  114. “A comprehensive survey of forgetting in deep learning beyond continual learning” In arXiv preprint arXiv:2307.09218, 2023
  115. Matthew D Zeiler and Rob Fergus “Visualizing and understanding convolutional networks” In European conference on computer vision, 2014, pp. 818–833 Springer
  116. “Places: A 10 million image database for scene recognition” In IEEE transactions on pattern analysis and machine intelligence, 2017
  117. “Towards Best Practices of Activation Patching in Language Models: Metrics and Methods” In arXiv preprint arXiv:2309.16042, 2023
  118. “Representation engineering: A top-down approach to ai transparency” In arXiv preprint arXiv:2310.01405, 2023
  119. “Intriguing Properties of Data Attribution on Diffusion Models” In arXiv preprint arXiv:2311.00500, 2023
  120. “Modifying memories in transformer models” In arXiv preprint arXiv:2012.00363, 2020
  121. “Revisiting the importance of individual units in cnns via ablation” In arXiv preprint arXiv:1806.02891, 2018
Citations (8)

Summary

  • The paper introduces the Coar algorithm, a regression-based method that efficiently assigns attribution scores to individual model components.
  • It demonstrates robust performance against methods like Neuron Conductance across diverse architectures and datasets, including ResNet and Vision Transformers.
  • The approach enables practical model editing by facilitating precise interventions in model behavior without the need for extensive retraining.

Unveiling Model Predictions via Component Attribution with Coar

Introduction to Component Modeling

In traditional machine learning applications, the opaque nature of model computations, especially within large models, presents significant challenges in understanding and interpreting a model's predictions. Addressing this issue, the concept of component modeling presents itself as a method to decompose model predictions in accordance with individual model components, such as attention heads or convolutional filters.

Task Definition and the Coar Framework

Component modeling aims at building interpretable estimators or models that can predict the output of a model after hypothetical modifications (ablations) to its components. This task helps in both understanding the underlying functioning of model components and leveraging this understanding for practical applications like model editing.

One pivotal approach within component modeling is known as component attribution. This method assigns scores to each model component, predicting the effect of ablating (modifying or removing) a set of components based on the sum of their scores. The Coar algorithm, a key contribution of the discussed work, is a scalable method to compute these attributions efficiently across various architectures and data modalities.

Methodology of Coar

Coar stands for Component Attribution via Regression. It simplifies the estimation of component attributions by treating the problem as a regression analysis where the contributive weights of components toward the model's prediction are directly learned from data. To generate necessary training data, Coar ablates components at random and observes changes in predictions, thus constructing a dataset of component outcomes that it uses to learn attributions.

Empirical Validation

The effectiveness of Coar is demonstrated across models trained on different datasets (such as ImageNet and CIFAR-10) and model architectures like ResNet and Vision Transformers. Results indicate that:

  • Coar calculations efficiently predict the impact of component ablations.
  • Component attributions offer insights into the contribution of specific parts of the model towards making predictions.
  • Coar consistently outperforms existing methods such as Neuron Conductance and Internal Influence in predicting these impacts, demonstrating robust and scalable performance.

Practical Applications and Model Editing

Moving beyond interpretation, Coar's computed attributions facilitate model editing, where direct interventions in a model's behavior are executed without retraining. Through various tasks like fixing specific predictive errors, enhancing robustness against types of noise, and "forgetting" undesired biases, component attributions by Coar enable precise and target-specific modifications to the model's behavior.

Concluding Thoughts

The combination of component modeling and the Coar algorithm represents a significant advancement in interpretable machine learning techniques. By allowing an analytical peek into the often opaque process of model predictions and providing a tool to manipulate these predictions consciously and purposefully, this approach paves the way for more understandable, fair, and controllable AI systems. Future work might explore extending the linear regression-based attributions to non-linear models for enhanced accuracy and investigating other forms of component interventions beyond zeroing out weights, potentially allowing even finer control over model behavior.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 14 tweets with 472 likes about this paper.