Papers
Topics
Authors
Recent
Search
2000 character limit reached

Semantic Gravity Wells in LLMs

Updated 14 January 2026
  • Semantic gravity wells are a conceptual and quantitative framework that models the statistical pull toward high-probability tokens in LLMs under external constraints.
  • The framework integrates mechanistic analysis of negative constraint failures with a geometric field-theoretic approach, detailing priming and override failure modes.
  • It offers actionable insights for prompt design and intervention strategies by quantifying token-level dynamics in a curved semantic manifold.

A semantic gravity well is a conceptual and quantitative framework for understanding the dynamics of token selection in LLMs, centering on how the model’s inherent statistical “pressure” towards certain tokens interacts with external constraints or attractors. The concept has been developed in two influential lines of research: (1) the mechanistic analysis of negative constraints and constraint failure in instruction-following LLMs (Rana, 12 Jan 2026), and (2) a geometric field-theoretic formalization of LLM text generation, where “information gravity” governs token flow in a curved semantic manifold (Vyshnyvetska, 29 Apr 2025).

1. Semantic Pressure and the Gravity Well Metaphor

In the context of LLMs, a semantic gravity well represents the statistical pull toward high-probability tokens during sequence generation. When negative constraints are applied (e.g., “Do not use word X”), there arises a competition: the model’s baseline probability of emitting X (semantic pressure) versus the suppression induced by the constraint (constraint pressure) (Rana, 12 Jan 2026).

Formally, let XX be a forbidden word, and S(X)S(X) the set of token sequences decoding to XX. Given a context, the semantic pressure toward XX is

P0=sS(X)i=1sP(sicontext,s<i)P_0 = \sum_{s \in S(X)} \prod_{i = 1}^{|s|} P(s_i \mid \text{context}, s_{<i})

where P0[0,1]P_0 \in [0,1] measures the pre-constraint likelihood of XX. The empirically observed probability pp that the model violates the negative constraint follows a logistic relationship with P0P_0: p=σ(2.40+2.27P0)p = \sigma(-2.40 + 2.27 P_0) with σ(z)=1/(1+ez)\sigma(z) = 1/(1 + e^{-z}); at P0=0.1P_0 = 0.1, violations occur at approximately 9%9\%, rising to over 46%46\% at P0=0.9P_0 = 0.9 in a dataset of $40,000$ generations across $2,500$ prompts.

The gravity well metaphor is similarly formalized in geometric terms: text generation can be viewed as movement over a latent semantic manifold M\mathcal{M}, with potential wells corresponding to regions of low Φ(x;Q)=logP(tQ)\Phi(x;Q) = -\log P(t|Q), where xx is a token embedding and QQ the query context (Vyshnyvetska, 29 Apr 2025).

2. Mechanisms of Negative Constraint Failure

Negative constraint failures manifest in two distinct mechanistic modes: priming failures and override failures (Rana, 12 Jan 2026).

A. Priming Failure (87.5% of violations):

The explicit mention of the forbidden word in the instruction inadvertently primes its internal representation:

  • Attentional analysis introduces the Priming Index (PI), defined as the difference between the model’s fraction of attention to the forbidden token mention (TMF) and “do not” (NF): PI=TMFNFPI = TMF - NF.
  • In priming failures, PI>0PI > 0; paradoxically, in certain cases, the constraint actually increases P(X)P(X): ΔP=P0P1<0\Delta P = P_0 - P_1 < 0.

B. Override Failure (12.5%):

Here, initial suppression is overwhelmed by late-inference feed-forward network (FFN) contributions:

  • Decomposition in layer 27, using the logit lens, separates the attention and FFN effects:
    • In failures: FFNContrib+0.386FFNContrib \approx +0.386, AttnContrib=0.132AttnContrib = -0.132.
    • In successes: FFNContrib+0.097FFNContrib \approx +0.097, AttnContrib=0.048AttnContrib = -0.048.
  • The +0.39+0.39 logit injection from the FFN pathway in failures almost quadruples the suppressive effect, flipping the constraint’s influence at the final stage.

3. Geometric Models: Information Gravity and the Semantic Manifold

Information gravity theory models LLM token generation as the trajectory of a particle on a Riemannian manifold M\mathcal{M} with a learned metric gijg_{ij} (Vyshnyvetska, 29 Apr 2025): ds2=gij(x)dxidxj,gij=iϕjϕds^2 = g_{ij}(x) dx^i dx^j, \quad g_{ij} = \partial_i \phi \cdot \partial_j \phi The user query QQ acts as a source with information mass: M(Q)=αH(Q)+βD(Q)+γN(Q)M(Q) = \alpha H(Q) + \beta D(Q) + \gamma N(Q) where H(Q)H(Q) is query entropy, D(Q)D(Q) is context mutual information, and N(Q)N(Q) is query novelty. Local curvature in M\mathcal{M}, and thus the potential wells (valleys of low Φ\Phi), depends on this information mass, affecting token attraction dynamics.

Token generation is then Boltzmann-like: P(tQ)=exp(Φ(x;Q)/T)Z(Q,T)P(t|Q) = \frac{\exp(-\Phi(x;Q)/T)}{Z(Q,T)} where TT controls sampling stochasticity and ZZ is a normalizing partition function. As T0T \to 0, generation becomes deterministic and is dominated by the deepest part of the gravitational well.

Empirical phenomena explained by this framework include:

  • Hallucinations: Trajectories wander into “semantic voids” (regions with low training data and high Φ\Phi).
  • Prompt Sensitivity: Tiny edits to the query QQ sharply deform the potential well, yielding major shifts in likely continuation.
  • Diversity vs. Coherence: Controlled by the “temperature” parameter, which sets the path’s adherence to the well’s minima.

4. Layer-wise Dynamics and Causal Analysis

Systematic analysis of layer-wise representations using the logit lens reveals a three-phase trajectory (Rana, 12 Jan 2026):

  • Layers 0–20: P()(X)104P^{(\ell)}(X) \approx 10^{-4}; target probability flat.
  • Layers 21–27: Divergence; in failures, the probability surges, in successes it remains suppressed.
  • Final Layer: Baseline/failure P0.71P \approx 0.71; negative/failure P0.66P \approx 0.66; negative/success P0.08P \approx 0.08.

Causality is established by activation patching, which entails replacing the residual stream at various layers in a negative-instruction run with those from a baseline run. Patching layers $24$–$27$ in failures increases P(X)P(X) by up to +0.07+0.07, proving that these layers are causally responsible for constraint override.

There is a pronounced suppression asymmetry at decision time:

  • ΔP22.8\Delta P \approx 22.8pp (successes), ΔP5.2\Delta P \approx 5.2pp (failures), i.e., suppression is 4.4×4.4\times weaker in failures.

5. Design Implications and Diagnostic Strategies

The findings on semantic gravity wells directly inform the design of prompts and constraint systems in LLMs (Rana, 12 Jan 2026):

  • Avoid Explicit Naming: Since 87.5% of failures are priming-driven, negative constraints should not mention the forbidden word. Alternatives include categorical bans, euphemisms, or positive phrasing.
  • Anticipate High-Pressure Cases: P0P_0 should be computed for the forbidden token; high P0P_0 prompts require stronger mitigation.
  • Layer-Targeted Interventions: In principle, one can architecturally damp or override late-layer FFN activations responsible for constraint override.
  • Post-Generation Filtering: For safety-critical applications, downstream filtering or reranking is necessary, as generation-time constraints alone are insufficient.
  • Attention-Based Diagnostics: Monitoring the Priming Index provides a real-time warning of imminent constraint violation.

6. Unified Theoretical Perspective and Empirical Significance

Both mechanistic and geometric perspectives converge on the semantic gravity well as a critical locus of LLM behavior:

  • The well represents the balance between model priors (statistical propensity) and any applied constraints (instructional or otherwise).
  • Violation rates and failure mechanisms have near-deterministic relationships with quantitative precursors (P0P_0, FFNContrib, PI), allowing prediction and (in principle) intervention.
  • The information gravity model’s use of Riemannian geometry, information mass, and potential theory yields an overarching explanation for observed failure modes, sensitivity phenomena, and diversity–coherence trade-offs in language generation.

In sum, semantic gravity wells formalize the interplay between statistical naturalness and external objectives in LLM output, both predicting the fragility of negative constraints and enabling mechanistic and geometric diagnostics of LLM output dynamics (Rana, 12 Jan 2026, Vyshnyvetska, 29 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Semantic Gravity Wells.