Determine the optimal spatial reduction strategy for applying CODEC to Vision Transformers

Determine the optimal spatial reduction strategy for Vision Transformers when applying Contribution Decomposition (CODEC), including how to aggregate token-level information into pseudo-channels for contribution computation and sparse autoencoder decomposition beyond the heuristic of treating tokens as spatial positions and summing over tokens.

Background

In extending CODEC to ViT-B, the authors take a straightforward approach by treating tokens as the spatial dimension and summing over tokens to obtain pseudo-channels analogous to CNN channels. They observe that while contribution modes remain more causally informative than activation modes, overall perturbation efficacy is weaker than in CNNs.

They attribute this, in part, to the lack of an explicit spatially equivariant bias in ViTs, implying that their simple spatial reduction may be suboptimal and that a different aggregation scheme could better capture causal organization in ViTs.

References

We leave an exploration of the optimal spatial reduction strategy for ViTs to future work.

Causal Interpretation of Neural Network Computations with Contribution Decomposition  (2603.06557 - Melander et al., 6 Mar 2026) in Supplemental Material, Section “CODEC on ViTs,” concluding sentence of the overview preceding Sparsity/Correlation/Perturbation analyses