Causal Interpretation of Neural Network Computations with Contribution Decomposition

Published 6 Mar 2026 in cs.LG and q-bio.NC | (2603.06557v1)

Abstract: Understanding how neural networks transform inputs into outputs is crucial for interpreting and manipulating their behavior. Most existing approaches analyze internal representations by identifying hidden-layer activation patterns correlated with human-interpretable concepts. Here we take a direct approach to examine how hidden neurons act to drive network outputs. We introduce CODEC (Contribution Decomposition), a method that uses sparse autoencoders to decompose network behavior into sparse motifs of hidden-neuron contributions, revealing causal processes that cannot be determined by analyzing activations alone. Applying CODEC to benchmark image-classification networks, we find that contributions grow in sparsity and dimensionality across layers and, unexpectedly, that they progressively decorrelate positive and negative effects on network outputs. We further show that decomposing contributions into sparse modes enables greater control and interpretation of intermediate layers, supporting both causal manipulations of network output and human-interpretable visualizations of distinct image components that combine to drive that output. Finally, by analyzing state-of-the-art models of neural activity in the vertebrate retina, we demonstrate that CODEC uncovers combinatorial actions of model interneurons and identifies the sources of dynamic receptive fields. Overall, CODEC provides a rich and interpretable framework for understanding how nonlinear computations evolve across hierarchical layers, establishing contribution modes as an informative unit of analysis for mechanistic insights into artificial neural networks.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces CODEC, a method leveraging sparse autoencoders to extract sparse contribution modes that reveal the causal drivers of neural network outputs.
It employs gradient-based attribution and PCA to quantify hidden units’ causal effects, demonstrating superior interpretability over traditional activation-based methods.
The framework enables precise output control via targeted ablation experiments and extends its applicability to both artificial and biological neural networks.

Causal Contribution Decomposition: Mechanistic Analysis of Neural Network Computation

Introduction

The paper "Causal Interpretation of Neural Network Computations with Contribution Decomposition" (2603.06557) establishes a principled framework for dissecting the causal roles of hidden units in both artificial and biological neural networks. The proposed method, CODEC (Contribution Decomposition), leverages sparse autoencoders to extract sparse computational motifs—termed "contribution modes"—from hidden-layer contributions, surpassing traditional activation-based interpretability methods in identifying units that are necessary and sufficient for specific outputs. CODEC elucidates the transformation of input to output through cascading nonlinearities, offering both causal and human-interpretable decompositions.

Theoretical Framework: From Receptive and Projective Fields to Contribution Analysis

The paper reframes network interpretability using concepts adapted from systems neuroscience: each hidden component’s action is viewed as a composition of its receptive field (input sensitivity) and its projective field (output influence). This compositional perspective is operationalized via contribution analysis, which quantifies each unit’s effect on outputs, distinguishing causal drivers from irrelevant representational activity.

Figure 1: Illustration of receptive-projective field composition in biological and artificial circuits, defining contributions of intermediates to computation.

Most prior interpretability techniques (e.g., saliency mapping, concept activation vector analysis) quantify network representations but do not address the causal integration of features across units. CODEC advances attribution by capturing how groups of hidden neurons act in concert to construct outputs, targeting scalar functions of the output (e.g., top logit, entropy) to enable tractable decomposition.

Quantification of Hidden Contributions: Gradient-Based Mechanisms

Contribution computation utilizes adaptations of Integrated Gradients, ActGrad, and SmoothGrad to attribute output changes to hidden units. The paper standardizes the approach by spatial summation over convolutional feature maps, yielding per-channel causal contributions. Integrated Gradients is favored for its completeness guarantee: contributions sum to the scalar output, allowing rigorous assignment of causal effects.

Figure 2: Pipeline for contribution computation in ResNet-50, spatial maps of channel activations vs. contributions, and summary of positive/negative/net channel effects.

Spatially aggregated channel contributions across datasets generate matrices amenable to further analysis, such as PCA for dimensionality assessment and autoencoder-based population motif discovery.

Structural Evolution of Contributions in Deep Networks

Layerwise analysis reveals several key emergent properties:

Sparsity: Hidden unit contributions are consistently more sparse than activations (quantified via the Hoyer index), especially at deeper layers. This indicates computation is supported by a small subset of relevant units per input.
Sign Decorrelaton: Positive and negative contributions within individual channels are highly correlated in early layers but become increasingly decorrelated deeper in the network. Thus, units transition from mixed excitatory/inhibitory function to specialized causal roles.
Dimensionality: Contributions display higher principal component dimensionality than activations, implying greater combinatorial diversity in causal effects.
Figure 3: Evolution of sparsity, sign decorrelation, and dimensionality of channel contributions across ResNet-50 layers.

Sparse Autoencoder Decomposition: Extraction of Contribution Modes

CODEC applies sparse autoencoders to matrices of channel contributions, identifying high-fidelity modes that reconstruct network output with $R^2 \sim 0.85$ . Each mode corresponds to a motif of coordinated channel activity causally driving specific outputs. Correlation analysis demonstrates that contribution modes are more tightly aligned with class semantics than activation modes, especially at intermediate layers.

Figure 4: Schematic and example of sparse mode loadings, showing strong class correlation in contribution modes.

Modes are robust to autoencoder hyperparameters and generalize across architectures. The loadings of these motifs are highly correlated with semantic classes in ImageNet, capturing the necessary dimensions for output-specific control.

Figure 5: Histograms and statistics quantifying mode/class correlations—contribution modes outperform activation modes and individual channels.

Causal Manipulation and Network Control via Modes

The identification of contribution modes enables direct perturbation experiments. Targeted ablation or preservation of channels corresponding to a specific mode for a class produces precise changes in classification performance. This yields strong specificity: ablation of only 2% of salient channels from the top modes eliminates target-class accuracy without affecting off-target classes, outperforming activation-based strategies.

Figure 6: Mode-guided ablation and preservation experiments establish necessity/sufficiency of identified channels for target class output.

Ablation specificity increases with depth, indicating a transition to semantically localized causal mechanisms. The same mode-based protocol manipulates superordinate taxonomies (e.g., "dog" classes), showing general applicability to higher-level representation control.

Visualization: Input Mapping through Contribution Modes

The method extends traditional saliency analysis by mapping input regions driving mode contributions, decomposing the standard input-output gradient into mode-specific pathways. Contribution maps reveal interpretable visual features (e.g., parts, textures) causally linked to classification output, aspects often missed in activation-based explanations.

Figure 7: Visualization of input regions contributing through modes across classes, demonstrating compositional, interpretable features (e.g., shiny wood, hands, etc.).

Application to Biological Neural Networks

CODEC is applied to convolutional models fit to retinal ganglion cell (RGC) responses under natural stimuli, using surprisal as a target. Modes correspond to combinatorial activity of interneurons—mirroring biological mechanisms for dynamic receptive field generation. Clustering of RGCs in mode space recovers known functional classes, and analysis of instantaneous receptive fields under mode activation recapitulates experimentally observed diversity (from center-surround to oriented textures).

Figure 8: Decomposition of retinal CNN shows that sparse contribution modes generate dynamic receptive fields and drive cell clusters mirroring biological interneuron population functions.

CODEC Generalization to Vision Transformers

Application to ViT-B demonstrates that contributions, not activations, remain sparse across token features, MLP, and attention layers. Contribution modes again yield greater specificity in output manipulation via ablation, despite computational strategy differences (lack of spatial equivariant bias). Attention head analysis finds minimal specialization, with causal contributions distributed micro-scale across heads.

Implications, Limitations, and Future Directions

CODEC establishes sparse, causally interpretable units—contribution modes—as foundational for network analysis, bridging the gap between human conceptual reasoning and network mechanistic function. The framework supports compositionally explicit manipulation in both artificial and biological systems, providing a pathway to principled explainable AI and neuroscientific hypothesis generation. It directly augments prior concept-based explainability [Kim et al., TCAV] and dictionary learning approaches for monosemantic decomposition [Bricken et al., 2023], but critically emphasizes causal necessity/sufficiency, not just representation.

Practically, the mechanism enables targeted control of model outputs (for safe AI design, robustness, and transfer), and theoretically, suggests that deep networks internally organize computation into sparse, high-dimensional motifs corresponding to output behaviors.

Limitations include restriction to certain architectures (primarily CNNs for full attribution completeness), partial analysis of sequence models and LLMs, and computational cost for contribution analysis in large-scale models. The framework is architecturally agnostic but further research is needed for optimal reduction strategies in transformer architectures. Extending CODEC to LLMs and multitask systems could facilitate modular compositionality in neural computation design.

Conclusion

Contribution Decomposition via CODEC represents a rigorous step toward mechanistic causal interpretability in neural networks. By focusing on contributions rather than activations and extracting sparse modes representing coordinated causal actions, the framework facilitates control, visualization, and biological relevance beyond what is accessible with standard representation-based interpretability. This approach lays the groundwork for principled manipulation and understanding of complex hierarchical neural computations across domains.

Markdown Report Issue