Papers
Topics
Authors
Recent
Search
2000 character limit reached

Incoherent Probability Judgments in Large Language Models

Published 30 Jan 2024 in cs.CL and cs.AI | (2401.16646v2)

Abstract: Autoregressive LLMs trained for next-word prediction have demonstrated remarkable proficiency at producing coherent text. But are they equally adept at forming coherent probability judgments? We use probabilistic identities and repeated judgments to assess the coherence of probability judgments made by LLMs. Our results show that the judgments produced by these models are often incoherent, displaying human-like systematic deviations from the rules of probability theory. Moreover, when prompted to judge the same event, the mean-variance relationship of probability judgments produced by LLMs shows an inverted-U-shaped like that seen in humans. We propose that these deviations from rationality can be explained by linking autoregressive LLMs to implicit Bayesian inference and drawing parallels with the Bayesian Sampler model of human probability judgments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. “GPT-4 Technical Report” In arXiv preprint arXiv:2303.08774, 2023
  2. “Using cognitive psychology to understand GPT-3” In Proceedings of the National Academy of Sciences 120.6 National Acad Sciences, 2023, pp. e2218523120
  3. “Sparks of artificial general intelligence: Early experiments with GPT-4” In arXiv preprint arXiv:2303.12712, 2023
  4. “Probabilistic biases meet the Bayesian brain” In Current Directions in Psychological Science 29.5 Sage Publications Sage CA: Los Angeles, CA, 2020, pp. 506–512
  5. “Surprisingly rational: probability theory plus noise explains biases in judgment.” In Psychological Review 121.3 American Psychological Association, 2014, pp. 463–480
  6. “Heuristic decision making” In Annual Review of Psychology 62 Annual Reviews, 2011, pp. 451–482
  7. “Bayes in the age of intelligent machines” In arXiv preprint arXiv:2311.10206, 2023
  8. John J Horton “Large language models as simulated economic agents: What can we learn from homo silicus?”, 2023
  9. “Calibrating predictive model estimates to support personalized medicine” In Journal of the American Medical Informatics Association 19.2 BMJ Group, 2012, pp. 263–274
  10. “Subjective probability: A judgment of representativeness” In Cognitive Psychology 3.3 Elsevier, 1972, pp. 430–454
  11. “Bruno: A deep recurrent model for exchangeable data” In Advances in Neural Information Processing Systems 31, 2018
  12. Ananya Kumar, Percy S Liang and Tengyu Ma “Verified uncertainty calibration” In Advances in Neural Information Processing Systems 32, 2019
  13. “An objective justification of Bayesianism I: Measuring inaccuracy” In Philosophy of Science 77.2 Cambridge University Press, 2010, pp. 201–235
  14. “Holistic evaluation of language models” In arXiv preprint arXiv:2211.09110, 2022
  15. “Testing different stochastic specificationsof risky choice” In Economica 65.260 Wiley Online Library, 1998, pp. 581–598
  16. “Embers of autoregression: Understanding large language models through the problem they are trained to solve” In arXiv preprint arXiv:2309.13638, 2023
  17. Joshua B Miller and Andrew Gelman “Laplace’s Theories of Cognitive Illusions, Heuristics and Biases” In Statistical Science 35.2, 2020, pp. 159–170
  18. “An experimental measurement of utility” In Journal of Political Economy 59.5 The University of Chicago Press, 1951, pp. 371–404
  19. “Language models are unsupervised multitask learners” In OpenAI blog 1.8, 2019, pp. 9
  20. Vaishnavi Shrivastava, Percy Liang and Ananya Kumar “Llamas Know What GPTs Don’t Show: Surrogate Models for Confidence Estimation” In arXiv preprint arXiv:2311.08877, 2023
  21. “A unified explanation of variability and bias in human probability judgments: How computational noise explains the mean–variance signature” In Journal of Experimental Psychology. General American Psychological Association (APA), 2023, pp. 2842–2860
  22. “Llama 2: Open foundation and fine-tuned chat models” In arXiv preprint arXiv:2307.09288, 2023
  23. “Attention is all you need” In Advances in Neural Information Processing systems 30, 2017
  24. “An explanation of in-context learning as implicit Bayesian inference” In arXiv preprint arXiv:2111.02080, 2021
  25. “Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers” In Icml 1, 2001, pp. 609–616
  26. “Deep de Finetti: Recovering Topic Distributions from Large Language Models” In arXiv preprint arXiv:2312.14226, 2023
  27. Jian-Qiao Zhu, Adam N Sanborn and Nick Chater “The Bayesian sampler: Generic Bayesian inference causes incoherence in human probability judgments.” In Psychological Review 127.5 American Psychological Association, 2020, pp. 719
  28. “Clarifying the relationship between coherence and accuracy in probability judgments” In Cognition 223 Elsevier, 2022, pp. 105022
  29. “The autocorrelated Bayesian sampler: A rational process for probability judgments, estimates, confidence intervals, choices, confidence judgments, and response times.” In Psychological Review American Psychological Association, 2023
Citations (2)

Summary

  • The paper demonstrates that LLMs produce incoherent probabilistic judgments deviating from theoretical expectations in probability identities.
  • It employs controlled experiments on weather and political events with varied temperature settings to highlight biases across GPT and LLaMA models.
  • Findings indicate that larger models reduce variance and incoherence, linking these deviations to Bayesian sampling processes in human-like reasoning.

Incoherent Probability Judgments in LLMs

This paper addresses the coherence of probability judgments expressed by Autoregressive LLMs, scrutinizing whether such models, known for producing coherent text, maintain coherence when tasked with probabilistic reasoning.

Evaluating Coherence in LLMs

The study focuses on the connection between autoregressive LLM outputs and human-like probability judgments, specifically examining common systematic deviations from probability theory.

Methods

Four LLMs—GPT-3.5-turbo, GPT-4, LLaMA-2-7b, and LLaMA-2-70b—were evaluated using probabilistic identities that ideally should equate to zero, representing probability coherence. The LLMs were prompted to assign probabilities to event pairs related to weather and politics, using a uniform framing of the queries to ensure consistency. Different temperature settings (0 and 1) were utilized to explore model behavior under varying levels of stochasticity, and the coherence of responses was gauged through multiple repetitions of identical prompts at temperature 1.

Results

The results revealed consistent biases among LLMs, characterized by systematic deviations in the probabilistic identities comparable to those observed in human reasoning. Figure 1

Figure 1: Bias and variability in human probability judgments as revealed by (left) probabilistic identities and (right) mean-variance relationship. Error bars are 95% CI.

LLMs' probabilistic identities did not equate to zero, indicating incoherence. These deviations were influenced by the imbalance of positive and negative terms within the identities, mirroring patterns seen in humans. Figure 2

Figure 2: Probabilistic identities based on LLM responses. For coherent judgments, identities should be zero.

Mean-variance relationships in repeated probability judgments also exhibited inverted-U shapes. Variants with more parameters demonstrated decreased deviation and lowered variance, suggesting that increased model size enhances coherence but fails to achieve perfection. Figure 3

Figure 3: The relationship between mean and variance in repeated probability judgments shows an inverted-U shape.

Theoretical Implications

Human-Like Deviations

The study relates the incoherent judgments of LLMs to patterns in human probabilistic reasoning, explained by the Probability Theory plus Noise (PT+N) model and the Bayesian Sampler model. Both these models propose intrinsic processes for generating probability samples, explaining deviations from coherence through various factors like additional noise or the influence of priors.

Bayesian Framework for LLMs

The paper proposes a Bayesian interpretation of LLM judgments: autoregressive processes can be linked to implicit Bayesian inference, where the LLM’s conditional probability predictions are viewed as Bayesian updates on prior distributions. The paper discusses how autoregressive training aligns with Bayesian mechanisms, suggesting that LLMs' deviations have parallels to Bayesian sampling in humans.

Future Directions

The research implies a potential strategy for enhancing the reliability of AI probability outputs: adjusting incoherence rather than just calibrating against true frequencies. The coherence-accuracy relationship in bounded rational agents supports this recalibrative approach, opening avenues for refining AI's probabilistic outputs to be more reliable for practical applications.

Conclusion

The examination of LLM coherence when making probabilistic judgments reveals similarities with human cognitive biases, facilitated by autoregressive training processes. This understanding bridges neural network methodologies with Bayesian reasoning models, contributing to the broader comprehension of both human-like and artificial cognitive processes.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.