Metrics balancing sparsity, fidelity, and mechanistic completeness

Develop evaluation metrics that jointly balance sparsity, fidelity, and mechanistic completeness in mechanistic interpretability, accounting for the trade-off between interpretable sparse feature decompositions and complete representation of genuine mechanisms.

Background

Sparse Autoencoders and related methods promote monosemantic, interpretable features but risk missing distributed components of true mechanisms. Conversely, dense representations may better capture completeness but hinder interpretability.

The authors explicitly call out the need for metrics that quantify and balance these desiderata, enabling fair evaluation and method selection across different interpretability approaches.

References

Accounting for this trade-off, and developing evaluation metrics that balance sparsity, fidelity, and mechanistic completeness, remains an open challenge for MI.

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models  (2601.14004 - Zhang et al., 20 Jan 2026) in Section “Challenges and Future Directions”