Papers
Topics
Authors
Recent
Search
2000 character limit reached

Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces

Published 13 Jul 2025 in cs.CL and cs.LG | (2507.09709v2)

Abstract: Understanding the latent space geometry of LLMs is key to interpreting their behavior and improving alignment. However, it remains unclear to what extent LLMs internally organize representations related to semantic understanding. To explore this, we conduct a large-scale empirical study of hidden representations in 11 autoregressive models across 6 scientific topics. We find that high-level semantic information consistently resides in low-dimensional subspaces that form linearly separable representations across domains. This separability becomes more pronounced in deeper layers and under prompts that elicit structured reasoning or alignment behavior$\unicode{x2013}$even when surface content remains unchanged. These findings support geometry-aware tools that operate directly in latent space to detect and mitigate harmful or adversarial content. As a proof of concept, we train an MLP probe on final-layer hidden states to act as a lightweight latent-space guardrail. This approach substantially improves refusal rates on malicious queries and prompt injections that bypass both the model's built-in safety alignment and external token-level filters.

Summary

  • The paper demonstrates that high-level semantics are captured by a few principal components in LLMs’ latent spaces.
  • Experimental results reveal that semantic representations become linearly separable via SVMs, especially in the deeper layers of the models.
  • The study highlights that chain-of-thought prompts yield distinct latent structures, paving the way for geometry-aware interventions for model safety.

Encoding Semantic Information in Low-Dimensional Linear Subspaces

This essay examines the paper "LLMs Encode Semantics in Low-Dimensional Linear Subspaces" (2507.09709), which explores how high-level semantic information is organized within the latent spaces of LLMs. Through extensive experimentation, the paper demonstrates that LLMs encode semantics in low-dimensional linear subspaces, leading to linearly separable representations across diverse semantic domains.

Intrinsic Dimensionality of Representations

The study conducts a comprehensive analysis of the intrinsic dimensionality of latent spaces across various transformer models. It is observed that a small number of principal components account for nearly all the variance in hidden states, indicating that the semantic information is compressed into low-dimensional linear subspaces. Figure 1

Figure 1

Figure 1

Figure 1

Figure 1: Percentage of principal components (relative to hidden dimensionality) required to explain at least 90% of the total variance in physics abstracts, plotted across layer depth. Darker colors indicate the larger models within each model family.

These findings suggest that while hidden representations span the entire space Rd\mathbb{R}^d, the effective dimensionality, which captures semantic understanding, is much lower. Interestingly, the compression does not concentrate at a particular layer depth, highlighting variations among models concerning where semantic compression occurs.

Linear Separability in Latent Spaces

The paper evaluates whether semantic representations across different topics can be linearly separated within the latent space. Using a support vector machine (SVM) to test separability, the results indicate that semantic representations of different topics form distinct clusters, especially in the final layers, achieving high classification accuracy. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: SVM classification accuracy on representations of scientific abstracts as a function of layer depth. Results are averaged over 15 pairwise accuracies. Darker colors represent the larger model within each model family.

The research demonstrates that larger models with higher-dimensional spaces are better at linearly separating complex semantic structures. This separability increases with model depth, emphasizing the importance of the final layers in distinguishing semantic domains.

Influence of Contextual Instructions and Prompts

The paper investigates how instructions that elicit structured reasoning, such as chain-of-thought (CoT) prompts, affect the geometry of latent spaces. The study finds that CoT instructions lead to distinct, linearly separable representations, suggesting that instructions play a pivotal role in organizing and disentangling semantic content within the model's latent space. Figure 3

Figure 3

Figure 3: SVM classification accuracy on the representations of the same prompt with and without a one-sentence chain-of-thought instruction.

This behavior illustrates that even when the surface content remains unchanged, the underlying instruction can significantly impact the model's internal representation, offering insights into how semantic reasoning is encoded and influenced by external prompts.

Implications for Model Safety and Alignment

The geometric structure of latent spaces also has practical implications for model safety and alignment. By exploiting linear separability, the study proposes a latent-space guardrail using an MLP probe trained to detect malicious queries directly from hidden states. This approach demonstrates improved refusal rates for harmful prompts, showcasing the potential of geometry-aware interventions for enhancing model safety. Figure 4

Figure 4

Figure 4

Figure 4

Figure 4: Refusal rates across evaluation datasets. A paired McNemar test (p<0.05p<0.05) confirms that the guardrail significantly alters prompt handling—achieving higher refusal rates on harmful inputs and prompt injections compared to the baselines.

Conclusion

The paper underscores the importance of latent space geometry in LLMs, revealing that high-level semantics are organized into low-dimensional, linearly separable subspaces. This representational structure not only aids in understanding semantic encoding but also enables practical applications, such as building model safeguards within the latent space. By shedding light on the internal geometric properties, this work opens new avenues for improving model interpretability, control, and safety.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 113 likes about this paper.