The Origins of Representation Manifolds in Large Language Models

Published 23 May 2025 in cs.LG and cs.AI | (2505.18235v1)

Abstract: There is a large ongoing scientific effort in mechanistic interpretability to map embeddings and internal representations of AI systems into human-understandable concepts. A key element of this effort is the linear representation hypothesis, which posits that neural representations are sparse linear combinations of `almost-orthogonal' direction vectors, reflecting the presence or absence of different features. This model underpins the use of sparse autoencoders to recover features from representations. Moving towards a fuller model of features, in which neural representations could encode not just the presence but also a potentially continuous and multidimensional value for a feature, has been a subject of intense recent discourse. We describe why and how a feature might be represented as a manifold, demonstrating in particular that cosine similarity in representation space may encode the intrinsic geometry of a feature through shortest, on-manifold paths, potentially answering the question of how distance in representation space and relatedness in concept space could be connected. The critical assumptions and predictions of the theory are validated on text embeddings and token activations of LLMs.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a mathematical framework linking cosine similarity to the intrinsic geometry of LLM features.
It validates the multidimensional linear representation hypothesis through empirical tests on text embeddings and token activations.
The findings offer practical insights for enhancing model interpretability, safety, and feature extraction methods.

The Origins of Representation Manifolds in LLMs

Introduction

The study of representation manifolds within LLMs is a critical aspect of mechanistic interpretability, focusing on the transformation and mapping of complex embeddings into comprehensible concepts. This investigation is anchored in the Linear Representation Hypothesis (LRH), which proposes that neural representations are sparse linear combinations of nearly orthogonal vectors. These embody various features, enabling a rudimentary understanding of AI's decision-making processes. Recent discourse has extended this hypothesis to consider the representation of features as continuous and multidimensional, presenting them as manifolds rather than discrete points.

The primary contribution of this paper is a mathematical framework that links cosine similarity to feature geometry, providing a tool to interpret the intrinsic geometry of features. This study validates its theories through empirical evaluation on text embeddings and token activations in LLMs, illuminating the relationship between distance in representation space and semantic similarity.

Representation Manifolds as Feature Encodings

The paper postulates the multidimensional linear representation hypothesis as a generalization of LRH. Here, features are described as manifolds, reflecting not just presence or absence but continuous values. These are characterized by subspaces where features manifest as unit vectors, influenced by a non-negative scaling factor representing feature presence. The hypothesis positions manifolds as structures encoding features, connecting concepts through geodesic paths whose intrinsic geometry reflects human conceptual understanding.

Figure 1: Representation manifolds in LLMs: colours, years and dates. The first and third example show text embeddings obtained from OpenAI's {\tt text-embedding-large-3} model from prompts relating to English names for colours and dates of the year, respectively.

Continuous Correspondence Hypothesis

Central to the paper is the Continuous Correspondence Hypothesis (CCH), which asserts a bijective mapping between the representation vectors in LLMs and the underlying feature space. This hypothesis suggests that the manifold structure in representation space mirrors the feature's geometric form, preserved through homeomorphic transformations. CCH posits that features are mapped onto a unit hypersphere, enabling a seamless translation between feature values and their model representations.

Figure 2: Representation manifolds in token activations from layer 8 of Mistral 7B, processed via an SAE to extract representations of months of the year' anddays of the week', as demonstrated in previous studies.

Geometric Properties and Cosine Similarity

The paper explores the geometric configuration of manifolds, hypothesizing that cosine similarity in representation space correlates inversely with feature distance. This connection allows cosine similarity to act as a proxy for measuring conceptual relatedness, leveraging metric space theories to elucidate feature arrangement and correspondence with human-understandable concepts.

Figure 3: Evidence for the cosine similarity reflecting feature-related distances, providing insights into spatial arrangement of features within model representations.

Practical Implications and Applications

This theoretical framework paves the way for practical applications in AI safety, alignment, and enhanced interpretability. By elucidating the manifold structure of features, the paper provides a foundation for designing model interventions, refining feature extraction methodologies, and advancing sparse autoencoder models for better interpretability of complex systems.

Conclusion

The research effectively bridges the gap between abstract feature modeling and practical AI interpretability frameworks, offering a comprehensive method to decode the manifold nature of representations in LLMs. Such insights not only deepen understanding of LLM internals but also enhance their practical, safe deployment across various applications, suggesting further exploration of metric-based feature representations and their manifold geometries in complex AI systems.

Figure 4: Evidence against isometry with respect to the metric space $\*Z_{years}$, illustrating the nuanced relationship between year encodings and their geometric representation.

Overall, this study contributes innovative perspectives on feature encoding in LLMs, fostering advancements in the development of AI systems that align closer to human reasoning and interpretable decision-making processes.

Markdown Report Issue