Belief-state simplex geometries in natural-language LLMs

Establish whether large pretrained language models trained on naturalistic text develop internal simplex-shaped geometric representations whose barycentric coordinates encode probability distributions over discrete latent states, analogous to the belief-state simplices observed in transformers trained on hidden Markov model–generated sequences.

Background

Prior work has shown that transformers trained on sequences generated by hidden Markov models represent belief states as simplex-shaped geometries in their residual streams, with vertices corresponding to latent states and barycentric coordinates encoding belief distributions. It is unknown whether similar geometric encodings arise in LLMs trained on naturalistic text, where latent variables such as discourse mode, referential context, or syntactic role are abstract and lack ground-truth labels.

This paper introduces a pipeline combining sparse autoencoders, k-subspace clustering, and AANet simplex fitting to search for such structures in Gemma-2-9B, providing preliminary evidence but not a definitive resolution. The open problem asks for a clear determination of whether analogous belief-state simplex geometries are learned in natural language settings.

References

Whether LLMs trained on naturalistic text develop analogous geometric representations remains an open question.

Finding Belief Geometries with Sparse Autoencoders  (2604.02685 - Levinson, 3 Apr 2026) in Abstract