Finding representations in language models

Determine principled and reliable methods to find latent representations corresponding to linguistic abstractions in deep neural language models (LMs).

Background

The paper surveys limitations of prior approaches for uncovering representations in neural LLMs, including correlational probing methods and mechanistic interpretability techniques such as distributed alignment search (DAS). These approaches either impose restrictive assumptions (e.g., linearity) that may miss nonlinear structure or are so expressive that they risk finding spurious alignments, creating a dilemma for interpretability.

Motivated by this long-standing challenge, the authors propose perturbation—a data-driven method that traces how fine-tuning on a single adversarial example generalizes to related cases—as an alternative route to identifying representations without strong geometric assumptions.

References

However, finding representations in LMs remains an open problem.

Perturbation: A simple and efficient adversarial tracer for representation learning in language models  (2603.23821 - Rozner et al., 25 Mar 2026) in Section 1: Introduction