- The paper demonstrates that elicited belief probabilities from LLMs do not consistently align with the decision-making of rational, utility-maximizing agents.
- The authors utilize falsification-oriented tests—such as conditional independence, monotonicity, and prompt consistency—to rigorously assess belief coherence.
- Empirical results across medical diagnostic tasks reveal task- and model-dependent deviations, cautioning against direct reliance on LLM outputs in high-stakes decisions.
Rationality and Belief Coherence in LLMs: A Decision-Theoretic Analysis
Motivation and Context
The paper "Do LLMs Act Like Rational Agents? Measuring Belief Coherence in Probabilistic Decision Making" (2602.06286) advances a rigorous decision-theoretic evaluation of LLMs as agents in probabilistic tasks, particularly high-stakes medical diagnosis. Existing approaches focus primarily on accuracy, calibration, and probabilistic self-consistency of LLM outputs; however, the critical issue of whether elicited probabilities from LLMs truly represent actionable internal beliefs, and are consistent with observed decisions, remains unresolved. The authors confront this question by formalizing and empirically testing the coherence between belief elicitation and decision-making, targeting conditions where no rational (Bayesian utility-maximizing) agent could possibly rationalize the joint pattern of reported probabilities and chosen actions.
Decision-Theoretic Framework and Methodology
Leveraging the axiomatic tradition in economics (von Neumann & Morgenstern, random utility models, prospect theory), the paper formalizes the sufficient conditions for rational decision consistency under uncertainty. The subjective probability Ps​(θ∣x) is agnostically elicited, and the action space A includes diagnostic decisions as well as refusal ("defer"). The methodological core comprises several falsification-oriented tests:
- Conditional Independence (CI): After conditioning on an elicited belief, the outcome should not carry additional information predictive of the choice. Empirically, significant residual Conditional Mutual Information (CMI) between actions, outcomes, and elicited probabilities indicates belief–decision inconsistency.
- Monotone Pairwise Choice Probabilities: Relative choice odds must behave monotonically with the subjective probability, barring violations of independence of irrelevant alternatives (IIA). Empirical monotonicity violations for action pairs quantify departures from rationality.
- Prompt Consistency: Stability of elicited beliefs across varied prompts (different loss functions, reasoning instructions, Bayesian reasoning) tests the robustness of beliefs as internal states rather than linguistic outputs shaped by extrinsic prompt cues.
- Internal Probabilistic Consistency: Adherence to the law of iterated expectation is tested to probe the coherence and internal validity of probability distributions across auxiliary variables.
The entire evaluation is black-box, relying solely on output-level behavior rather than mechanistic interpretability of internal representations.
Empirical Evaluation: Diagnostic Medical Tasks
Experiments span four clinical diagnosis domains (structural heart disease, diabetes, pediatric fever, infant crying), using both real-world datasets and expert-constructed Bayesian networks. Leading frontier and open-source models are considered, including GPT-5 (in "High Reasoning" and "Minimal Reasoning" variants), Deepseek R1 671B, and Llama-4 Scout 17B.
Numerical Results and Violations
- Conditional Independence: All model/dataset combinations show significant CI violations; bootstrapped CMI estimates are strictly above zero. Predictive improvement from including the outcome variable (after conditioning on elicited probability) varies from negligible to substantial, with some models (e.g., GPT-High, Deepseek) showing moderate leakage (10–20%), indicating that elicited beliefs are not fully decision-sufficient.
- Monotonicity: Most models display monotonic choice probabilities for Yes/No diagnostic decisions. However, monotonicity violations spike for deferral-involved pairs, with Llama-4 showing up to 50% significant violation rate in certain tasks. There is considerable heterogeneity across models and datasets.
- Prompt Consistency: For most prompts (MSE, MAE, standard), belief stability is comparable to same-prompt repetition. In contrast, Bayesian reasoning prompts induce larger deviations, especially in Llama. RMSE to ground truth remains above 0.2 for all prompt variants, suggesting that elicited beliefs are more influenced by prompt formulation than by task-specific inference robustness.
- Internal Consistency: LLMs exhibit substantially greater iterated expectation violations (ALIE(x)/p(x)) compared to baselines, implying internal probabilistic incoherence, even in models that otherwise display strong decision-consistency.
Implications for Theory and Practice
The study provides decisive evidence against interpreting LLM-reported probabilities as subjective beliefs that fully drive rational decision-making, at least under the standard rational utility maximization paradigm. The continuous presence of statistically significant violations in all evaluated models, even after calibration or prompt control, formally demonstrates that the behavioral outputs of LLMs are not rationalizable by classical Bayesian agents. Importantly, the degree of deviation is task- and model-dependent, and in many cases, is mild enough that utility maximization remains a practical approximation.
Contradictory Claims and Practical Utility
The paper claims that, although strong theoretical violations exist, elicited beliefs can still be a useful—if imperfect—lens for understanding model behavior when validated by task-specific tests. The robustness of belief–action coherence varies widely, indicating the necessity for tailored evaluations in safety-critical applications.
Prospects for Future AI Research
This analysis places the field on sounder footing regarding the interpretation and deployment of LLMs in decision support. It motivates further exploration into the sources and types of rationality violation—such as prompt-induced variability, model-specific inductive biases, and hidden representations. Advancements in mechanistic interpretability, cross-task reasoning stability, and uncertainty modeling are expected to bridge gaps between verbalized probabilities and actionable internal beliefs, ultimately improving trust and reliability in LLM-guided decision-making.
Conclusion
The paper introduces a comprehensive, decision-oriented framework for evaluating belief coherence in LLMs within high-stakes domains. Empirical tests reveal that LLMs' verbalized probabilities and actions are consistently misaligned with rational agent models, with internal and cross-task inconsistencies persistent across diverse settings. These findings caution against treating LLM outputs as direct proxies for internal subjective probabilities, underscoring the need for rigorous evaluation and refinement. The principled approach developed here enriches the toolkit for diagnosing and improving LLM behavior, paving the way for safer deployment in consequential applications.