Unsupervised correction for lexically-adjusted polysemanticity metrics

Develop an unsupervised version of the lexically-adjusted polysemanticity score that identifies and discounts lexical-identity contributions to neuron-level polysemanticity in transformer MLP activations without using sense annotations, so that the correction can be applied beyond controlled, sense-labeled evaluations.

Background

The paper introduces a lexically-adjusted polysemanticity score that subtracts a layer-specific estimate of lexical inflation from standard polysemanticity metrics, but the current implementation relies on sense labels to quantify the lexical contribution.

Because sense labels limit deployment to controlled evaluations, the authors explicitly identify extending this correction to an unsupervised setting as an open problem, aiming to make the adjustment broadly usable without annotated senses.

References

The current implementation requires sense labels, limiting it to controlled evaluations; extending this to an unsupervised correction is an open problem (Section~\ref{sec:discussion}).

Polysemanticity or Polysemy? Lexical Identity Confounds Superposition Metrics  (2604.00443 - Hou et al., 1 Apr 2026) in Section 6 (Results), Subsection "The confound is correctable"