Papers
Topics
Authors
Recent
Search
2000 character limit reached

Take Caution in Using LLMs as Human Surrogates: Scylla Ex Machina

Published 25 Oct 2024 in econ.GN, cs.AI, cs.CY, cs.HC, and q-fin.EC | (2410.19599v3)

Abstract: Recent studies suggest LLMs can exhibit human-like reasoning, aligning with human behavior in economic experiments, surveys, and political discourse. This has led many to propose that LLMs can be used as surrogates or simulations for humans in social science research. However, LLMs differ fundamentally from humans, relying on probabilistic patterns, absent the embodied experiences or survival objectives that shape human cognition. We assess the reasoning depth of LLMs using the 11-20 money request game. Nearly all advanced approaches fail to replicate human behavior distributions across many models. Causes of failure are diverse and unpredictable, relating to input language, roles, and safeguarding. These results advise caution when using LLMs to study human behavior or as surrogates or simulations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 .
  2. Using large language models to simulate multiple humans and replicate human subject studies. International Conference on Machine Learning. PMLR, 337–371.
  3. The 11–20 money request game: A level-k reasoning study. American Economic Review 102(7) 3561–3573.
  4. Out of one, many: Using language models to simulate human samples. Political Analysis 31(3) 337–351.
  5. Express: Ai-human hybrids for marketing research: Leveraging llms as collaborators. Journal of Marketing 00222429241276529.
  6. A systematic evaluation of large language models on out-of-distribution logical reasoning tasks. arXiv preprint arXiv:2310.09430 .
  7. The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288 .
  8. Using cognitive psychology to understand gpt-3. Proceedings of the National Academy of Sciences 120(6) e2218523120.
  9. Boden, Margaret. 1992. The philosophy of artificial intelligence,«oxford readings in philosophy» .
  10. Elephants never forget: Testing language models for memorization of tabular data. arXiv preprint arXiv:2403.06644 .
  11. Using gpt for market research. Available at SSRN 4395751 .
  12. Playing games with gpt: What can we learn about a large language model from canonical strategic games? Available at SSRN 4493398 .
  13. Language models are few-shot learners. ArXiv abs/2005.14165. URL https://api.semanticscholar.org/CorpusID:218971783.
  14. The secret sharer: Evaluating and testing unintended memorization in neural networks. 28th USENIX security symposium (USENIX security 19). 267–284.
  15. Extracting training data from large language models. 30th USENIX Security Symposium (USENIX Security 21). 2633–2650.
  16. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217 .
  17. Beyond prompt brittleness: Evaluating the reliability and consistency of political worldviews in llms. arXiv preprint arXiv:2402.17649 .
  18. Chemero, Anthony. 2023. Llms differ from human cognition because they are not embodied. Nature Human Behaviour 7(11) 1828–1829.
  19. Structural models of nonequilibrium strategic thinking: Theory, evidence, and applications. Journal of Economic Literature 51(1) 5–62.
  20. Strategizing with ai: Insights from a beauty contest experiment. Available at SSRN 4754435 .
  21. Unveiling the spectrum of data contamination in language models: A survey from detection to remediation. arXiv preprint arXiv:2406.14644 .
  22. Can ai language models replace human participants? Trends in Cognitive Sciences .
  23. Generalization or memorization: Data contamination and trustworthy evaluation for large language models. arXiv preprint arXiv:2402.15938 .
  24. Dreyfus, Hubert L. 1972. What computers can’t do: The limits of artificial intelligence .
  25. Large language models are not strong abstract reasoners. arXiv preprint arXiv:2305.19555 .
  26. Who thinks about the competition? managerial ability and strategic entry in us local telephone markets. American Economic Review 101(7) 3130–3161.
  27. Frontiers: Can large language models capture human preferences? Marketing Science .
  28. Ai and the transformation of social science research. Science 380(6650) 1108–1109.
  29. Economics arena for large language models. arXiv preprint arXiv:2401.01735 .
  30. Changing answer order can decrease mmlu accuracy. arXiv preprint arXiv:2406.19470 .
  31. Hagendorff, Thilo. 2023. Machine psychology: Investigating emergent capabilities and behavior in large language models using psychological methods. arXiv preprint arXiv:2303.13988 .
  32. The political ideology of conversational ai: Converging evidence on chatgpt’s pro-environmental, left-libertarian orientation. arXiv preprint arXiv:2301.01768 .
  33. The mind’s i: Fantasies and reflections on self and soul .
  34. Horton, John J. 2023. Large language models as simulated economic agents: What can we learn from homo silicus? Tech. rep., National Bureau of Economic Research.
  35. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 .
  36. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232 .
  37. Hutson, Matthew. 2024. How does chatgpt’think’? psychology and neuroscience crack open ai large language models. Nature 629(8014) 986–988.
  38. Personallm: Investigating the ability of large language models to express big five personality traits. arXiv preprint arXiv:2305.02547 .
  39. A survey on large language models for code generation. arXiv preprint arXiv:2406.00515 .
  40. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770 .
  41. Can large language models infer causation from correlation? arXiv preprint arXiv:2306.05836 .
  42. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 .
  43. Understanding the effects of rlhf on llm generalisation and diversity. arXiv preprint arXiv:2310.06452 .
  44. Llms as factual reasoners: Insights from existing benchmarks and beyond. arXiv preprint arXiv:2305.14540 .
  45. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 9459–9474.
  46. Large language models understand and can be enhanced by emotional stimuli. arXiv preprint arXiv:2307.11760 .
  47. Frontiers: Determining the validity of large language models for automated perceptual analysis. Marketing Science 43(2) 254–266.
  48. Lu, Siting. 2024. Strategic interactions between large language models-based agents in beauty contests. arXiv preprint arXiv:2404.08492 .
  49. Meta, AI. 2024. Introducing meta llama 3: The most capable openly available llm to date. Meta AI .
  50. Rethinking the role of demonstrations: What makes in-context learning work? URL https://arxiv.org/abs/2202.12837.
  51. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229 .
  52. Mitchell, Melanie. 2023. How do we know how smart ai systems are?
  53. Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models. arXiv preprint arXiv:2406.02061 .
  54. Towards systematic evaluation of logical reasoning ability of large language models. arXiv preprint arXiv:2404.15522 .
  55. Let’s think dot by dot: Hidden computation in transformer language models. arXiv preprint arXiv:2404.15758 .
  56. Testing the general deductive reasoning capacity of large language models using ood examples. Advances in Neural Information Processing Systems 36.
  57. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324 .
  58. Searle, John R. 1980. Minds, brains, and programs. Behavioral and brain sciences 3(3) 417–424.
  59. Testing theory of mind in large language models and humans. Nature Human Behaviour 1–11.
  60. Llms achieve adult human performance on higher-order theory of mind tasks. arXiv preprint arXiv:2405.18870 .
  61. Understanding unintended memorization in language models under federated learning. Proceedings of the Third Workshop on Privacy in Natural Language Processing. 1–10.
  62. Theorizing with large language models .
  63. Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems 36.
  64. Do large language models perform the way people expect? measuring the human generalization function. arXiv preprint arXiv:2406.01382 .
  65. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 24824–24837.
  66. Easy problems that llms get wrong. arXiv preprint arXiv:2405.19616 URL https://arxiv.org/abs/2405.19616.
  67. Depn: Detecting and editing privacy neurons in pretrained language models. arXiv preprint arXiv:2310.20138 .
  68. Language models meet world models: Embodied experiences enhance language models. Advances in neural information processing systems 36.
  69. Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:2401.11817 .
  70. Large language models as optimizers. ArXiv abs/2309.03409. URL https://api.semanticscholar.org/CorpusID:261582296.
  71. Can large language models always solve easy problems if they can solve harder ones? arXiv preprint arXiv:2406.12809 .
  72. Back to the future: Towards explainable temporal reasoning with large language models. Proceedings of the ACM on Web Conference 2024. 1963–1974.
  73. Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964 .
  74. Larger and more instructable language models become less reliable. Nature 1–8.
  75. Zizzo, Daniel John. 2010. Experimenter demand effects in economic experiments. Experimental Economics 13 75–98.
Citations (1)

Summary

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 1 like about this paper.