Semantic Gravity Wells in LLMs
- Semantic gravity wells are a conceptual and quantitative framework that models the statistical pull toward high-probability tokens in LLMs under external constraints.
- The framework integrates mechanistic analysis of negative constraint failures with a geometric field-theoretic approach, detailing priming and override failure modes.
- It offers actionable insights for prompt design and intervention strategies by quantifying token-level dynamics in a curved semantic manifold.
A semantic gravity well is a conceptual and quantitative framework for understanding the dynamics of token selection in LLMs, centering on how the model’s inherent statistical “pressure” towards certain tokens interacts with external constraints or attractors. The concept has been developed in two influential lines of research: (1) the mechanistic analysis of negative constraints and constraint failure in instruction-following LLMs (Rana, 12 Jan 2026), and (2) a geometric field-theoretic formalization of LLM text generation, where “information gravity” governs token flow in a curved semantic manifold (Vyshnyvetska, 29 Apr 2025).
1. Semantic Pressure and the Gravity Well Metaphor
In the context of LLMs, a semantic gravity well represents the statistical pull toward high-probability tokens during sequence generation. When negative constraints are applied (e.g., “Do not use word X”), there arises a competition: the model’s baseline probability of emitting X (semantic pressure) versus the suppression induced by the constraint (constraint pressure) (Rana, 12 Jan 2026).
Formally, let be a forbidden word, and the set of token sequences decoding to . Given a context, the semantic pressure toward is
where measures the pre-constraint likelihood of . The empirically observed probability that the model violates the negative constraint follows a logistic relationship with : with ; at , violations occur at approximately , rising to over at in a dataset of $40,000$ generations across $2,500$ prompts.
The gravity well metaphor is similarly formalized in geometric terms: text generation can be viewed as movement over a latent semantic manifold , with potential wells corresponding to regions of low , where is a token embedding and the query context (Vyshnyvetska, 29 Apr 2025).
2. Mechanisms of Negative Constraint Failure
Negative constraint failures manifest in two distinct mechanistic modes: priming failures and override failures (Rana, 12 Jan 2026).
A. Priming Failure (87.5% of violations):
The explicit mention of the forbidden word in the instruction inadvertently primes its internal representation:
- Attentional analysis introduces the Priming Index (PI), defined as the difference between the model’s fraction of attention to the forbidden token mention (TMF) and “do not” (NF): .
- In priming failures, ; paradoxically, in certain cases, the constraint actually increases : .
B. Override Failure (12.5%):
Here, initial suppression is overwhelmed by late-inference feed-forward network (FFN) contributions:
- Decomposition in layer 27, using the logit lens, separates the attention and FFN effects:
- In failures: , .
- In successes: , .
- The logit injection from the FFN pathway in failures almost quadruples the suppressive effect, flipping the constraint’s influence at the final stage.
3. Geometric Models: Information Gravity and the Semantic Manifold
Information gravity theory models LLM token generation as the trajectory of a particle on a Riemannian manifold with a learned metric (Vyshnyvetska, 29 Apr 2025): The user query acts as a source with information mass: where is query entropy, is context mutual information, and is query novelty. Local curvature in , and thus the potential wells (valleys of low ), depends on this information mass, affecting token attraction dynamics.
Token generation is then Boltzmann-like: where controls sampling stochasticity and is a normalizing partition function. As , generation becomes deterministic and is dominated by the deepest part of the gravitational well.
Empirical phenomena explained by this framework include:
- Hallucinations: Trajectories wander into “semantic voids” (regions with low training data and high ).
- Prompt Sensitivity: Tiny edits to the query sharply deform the potential well, yielding major shifts in likely continuation.
- Diversity vs. Coherence: Controlled by the “temperature” parameter, which sets the path’s adherence to the well’s minima.
4. Layer-wise Dynamics and Causal Analysis
Systematic analysis of layer-wise representations using the logit lens reveals a three-phase trajectory (Rana, 12 Jan 2026):
- Layers 0–20: ; target probability flat.
- Layers 21–27: Divergence; in failures, the probability surges, in successes it remains suppressed.
- Final Layer: Baseline/failure ; negative/failure ; negative/success .
Causality is established by activation patching, which entails replacing the residual stream at various layers in a negative-instruction run with those from a baseline run. Patching layers $24$–$27$ in failures increases by up to , proving that these layers are causally responsible for constraint override.
There is a pronounced suppression asymmetry at decision time:
- pp (successes), pp (failures), i.e., suppression is weaker in failures.
5. Design Implications and Diagnostic Strategies
The findings on semantic gravity wells directly inform the design of prompts and constraint systems in LLMs (Rana, 12 Jan 2026):
- Avoid Explicit Naming: Since 87.5% of failures are priming-driven, negative constraints should not mention the forbidden word. Alternatives include categorical bans, euphemisms, or positive phrasing.
- Anticipate High-Pressure Cases: should be computed for the forbidden token; high prompts require stronger mitigation.
- Layer-Targeted Interventions: In principle, one can architecturally damp or override late-layer FFN activations responsible for constraint override.
- Post-Generation Filtering: For safety-critical applications, downstream filtering or reranking is necessary, as generation-time constraints alone are insufficient.
- Attention-Based Diagnostics: Monitoring the Priming Index provides a real-time warning of imminent constraint violation.
6. Unified Theoretical Perspective and Empirical Significance
Both mechanistic and geometric perspectives converge on the semantic gravity well as a critical locus of LLM behavior:
- The well represents the balance between model priors (statistical propensity) and any applied constraints (instructional or otherwise).
- Violation rates and failure mechanisms have near-deterministic relationships with quantitative precursors (, FFNContrib, PI), allowing prediction and (in principle) intervention.
- The information gravity model’s use of Riemannian geometry, information mass, and potential theory yields an overarching explanation for observed failure modes, sensitivity phenomena, and diversity–coherence trade-offs in language generation.
In sum, semantic gravity wells formalize the interplay between statistical naturalness and external objectives in LLM output, both predicting the fragility of negative constraints and enabling mechanistic and geometric diagnostics of LLM output dynamics (Rana, 12 Jan 2026, Vyshnyvetska, 29 Apr 2025).