Measuring (a Sufficient) World Model in LLMs: A Variance Decomposition Framework

Published 19 Jun 2025 in cs.CL, cs.AI, and cs.LG | (2506.16584v1)

Abstract: Understanding whether LLMs possess a world model-a structured understanding of the world that supports generalization beyond surface-level patterns-is central to assessing their reliability, especially in high-stakes applications. We propose a formal framework for evaluating whether an LLM exhibits a sufficiently robust world model, defined as producing consistent outputs across semantically equivalent prompts while distinguishing between prompts that express different intents. We introduce a new evaluation approach to measure this that decomposes model response variability into three components: variability due to user purpose, user articulation, and model instability. An LLM with a strong world model should attribute most of the variability in its responses to changes in foundational purpose rather than superficial changes in articulation. This approach allows us to quantify how much of a model's behavior is semantically grounded rather than driven by model instability or alternative wording. We apply this framework to evaluate LLMs across diverse domains. Our results show how larger models attribute a greater share of output variability to changes in user purpose, indicating a more robust world model. This improvement is not uniform, however: larger models do not consistently outperform smaller ones across all domains, and their advantage in robustness is often modest. These findings highlight the importance of moving beyond accuracy-based benchmarks toward semantic diagnostics that more directly assess the structure and stability of a model's internal understanding of the world.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a three-way variance decomposition framework, distinguishing purpose sensitivity, articulation sensitivity, and model uncertainty in LLM responses.
It empirically shows that while larger models generally yield higher purpose sensitivity, performance varies significantly across different domains.
The approach advances LLM evaluation by quantifying consistency in mapping user intent to stable outputs, aiding risk assessment and fairness auditing.

Measuring Sufficient World Models in LLMs via Variance Decomposition

This paper presents a conceptual and empirical framework for diagnosing whether LLMs possess a "sufficient" world model, operationalized through consistency and sensitivity of outputs to user intent versus prompt articulation. The framework is instantiated as a three-way variance decomposition: Purpose Sensitivity (PS), Articulation Sensitivity (AS), and Model Uncertainty (MU). This diagnostic is empirically tested across established LLMs and domains with thorough sampling of intent-preserving paraphrases using cross-lingual translation.

Conceptual Framework

The core premise is that a genuine world model for language requires invariance of output distributions to semantically preserved changes in input phrasing, but appropriate sensitivity to changes in user intent. The standard, strong definition stipulates response distributions must be identical for prompts with the same underlying intent and divergent when intent changes. Recognizing practical difficulties in exhaustively defining and evaluating intent, the authors introduce a "sufficient" world model criterion that relies on functional evaluation: given an evaluator mapping outputs to task-relevant values, the distributions over these values should be equivalent for equivalent intents.

This approach explicitly eschews correctness-based evaluation in favor of measuring consistency and semantic generalization, which exposes both overfitting to spurious correlations (articulation sensitivity) and meaningful adaptation to user goals (purpose sensitivity).

Variance Decomposition Diagnostic

The evaluation framework decomposes observed variance in model responses over tasks into:

Purpose Sensitivity (PS): Variance attributable to changes in underlying user intent.
Articulation Sensitivity (AS): Variance due to different surface phrasings/expressions of fixed intent.
Model Uncertainty (MU): Residual variance not explained by intent or phrasing, encompassing sampling stochasticity and epistemic/aleatoric uncertainty.

These components are calculated via nested $R^2$ coefficients over a standardized, evaluator-processed response variable. For discrete outputs, a parallel information-theoretic decomposition using mutual information and entropy is provided.

The framework's operational strength is the Meaningful Variability Share (MVS), $MVS = PS / (PS+AS)$ , quantifying the fraction of explainable variation due to actual intent rather than spurious prompt cues.

Prompt Generation and Experimental Setup

To rigorously sample prompt variation for fixed user intent, the authors employ chain-based cross-lingual translation: base prompts are translated to and from typologically diverse languages using LLMs, yielding paraphrases with high surface variation and preserved semantics. LLM-based intent matching filters ensure paraphrase fidelity. Diverse prompts are further selected via sentence embedding space coverage.

For each task-intent pair, models are sampled multiple times per prompt using temperature sampling ( $T=1$ ), followed by LLM-based extraction of task-relevant values from free-form responses. This pipeline results in large sets of responses enabling precise variance estimates.

Empirical Findings

Across five major domains (health, logistics, finance, travel, social planning) and five LLMs of varying parameter size, the main experimental findings are:

Larger models attribute more output variance to purpose sensitivity (i.e., are more responsive to actual intent shifts, with higher PS and MVS).
This effect is neither monotonic nor universal: There exist domains and tasks where smaller or mid-sized models outperform or match larger ones on world model robustness.
Articulation sensitivity is not eliminated in larger models—indeed, in some cases, larger LLMs exhibit slightly increased AS, suggesting a trade-off between generalization and flexibility to surface form.
Model uncertainty dominates variance in open-ended tasks, reflecting both the inherent ambiguity of tasks and the LLMs' internal uncertainty.
Substantial domain variation: Certain domains (e.g., health/nutrition) show higher world model robustness, while others (e.g., social planning) exhibit greater sensitivity to phrasing and less alignment with intent shifts.

Tabulated results (see Tables 1–4 in the appendix) and decomposition figures provide detailed breakdowns per task, model, and domain. For instance, in health/nutrition and logistics tasks, the 70B LLaMA model consistently demonstrates high PS and MVS, but in personal finance and social planning, its advantage is modest or absent.

Contrary to some expectations, model scale alone does not guarantee robust semantic generalization, and in several settings, smaller or alternative LLMs rival the largest models in PS and MVS.

Implications and Limitations

This diagnostic method advances evaluation methodology for LLMs by shifting emphasis from benchmark accuracy to structural analysis of model generalization and intent alignment. It provides immediate utility for practitioners:

Risk assessment for real-world deployment: High AS flags products vulnerable to user disparities in dialect, formality, or education.
Fairness auditing: Quantitative tracking of AS over user groups enables targeted interventions for equity.
Model selection and ablation: The variance decomposition supports fine-grained ablation and model selection without reliance on static or easily saturated benchmarks.

For fundamental research, the findings challenge simplistic scaling narratives and incentivize the development of architectures and training schemes that explicitly separate semantic and superficial features. Future directions could address more complex forms of intent and extend analysis beyond mean and variance (e.g., full distributional matching or higher moments).

Methodological limitations include the reliance on extraction of numeric values (mitigated by proposed categorical/entropy-based variants), evaluator dependence, and a focus on first- and second-moment statistics.

Outlook

This work lays methodological foundations for more principled, user-aligned LLM evaluation. It highlights that model development and deployment must explicitly consider not only what models know, but how reliably they map diverse naturalistic input to semantically stable output. Future research may incorporate interactive, adaptive intent identification and extend this diagnostic to broader modalities (e.g., multimodal, embodied systems) as the frontier of world modeling in AI advances.

Markdown