Do Large Language Models Reason Causally Like Us? Even Better?

Published 14 Feb 2025 in cs.AI and cs.LG | (2502.10215v2)

Abstract: Causal reasoning is a core component of intelligence. LLMs have shown impressive capabilities in generating human-like text, raising questions about whether their responses reflect true understanding or statistical patterns. We compared causal reasoning in humans and four LLMs using tasks based on collider graphs, rating the likelihood of a query variable occurring given evidence from other variables. LLMs' causal inferences ranged from often nonsensical (GPT-3.5) to human-like to often more normatively aligned than those of humans (GPT-4o, Gemini-Pro, and Claude). Computational model fitting showed that one reason for GPT-4o, Gemini-Pro, and Claude's superior performance is they didn't exhibit the "associative bias" that plagues human causal reasoning. Nevertheless, even these LLMs did not fully capture subtler reasoning patterns associated with collider graphs, such as "explaining away".

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that LLMs can perform causal reasoning comparable to or even beyond human normative standards using collider causal structures.
It employs a comparative methodology with Spearman correlations across predictive, independence, and explaining away inferences between various LLMs and human data.
The study reveals that LLM performance varies by model and domain knowledge, emphasizing the influence of training data on causal reasoning abilities.

Do LLMs Reason Causally Like Us? Even Better?

Introduction

The research examines the causal reasoning abilities of LLMs, specifically evaluating how they compare to human reasoning regarding causal structures. Causal reasoning involves understanding relationships between variables beyond mere correlations, and it's essential for tasks such as policy recommendations and disease diagnosis. This paper explores the proficiency of LLMs in performing causal reasoning tasks akin to human reasoning.

Methodology

Participants and Models

The study involved comparing human data from Rehder (2017), consisting of 48 undergraduate participants, with outputs from four LLMs: GPT-3.5, GPT-4o, Claude, and Gemini-Pro. The models were prompted with inference tasks over five temperature settings, although only results for a temperature of 0.0 were detailed for consistency.

Materials and Procedure

A collider causal structure ( $C_1 \rightarrow E \leftarrow C_2$ ) was used within three domains: meteorology, economics, and sociology. Each domain presented tasks involving predictive and diagnostic inferences, assessing inferences like predictive causality, independence of causes, and explaining away phenomena. Figures illustrate these tasks (Figure 1), showcasing the understanding required for causal inferences.

Figure 1: Visualization of Causal Mechanism per Domain. The leftmost graph represents task X from the diagnostic inference group.

Results

The analysis revealed distinct reasoning patterns across the four LLMs and humans. Comparative evaluation of the models' performances, measured by Spearman correlations between their inferences and human inferences, showed significant alignment.

Inference Patterns

Predictive Inference: Similar to humans, LLMs recognized that causes increase the likelihood of effects, showcasing fundamental causal reasoning (Figure 2).

Figure 2: Reference Graph (task II).

Independence of Causes: While Claude exhibited minimal independence violations, GPT-3.5 and Gemini-Pro showed greater associative biases.
Explaining Away: GPT-4o demonstrated the strongest ability to explain away effects, while Gemini-Pro and GPT-3.5 did not, indicating an associative rather than causative understanding.

Performance metrics further illustrated that the variability in LLM outputs was more pronounced than in humans, suggesting domain knowledge influences LLM reasoning.

Figure 3: The parameter values from the 4-parameter causal Bayes net fits. Error bars are standard deviations.

Discussion

This study indicates that LLMs achieve a level of causal reasoning relevant to human capabilities but with variations attributable to domain knowledge embedded in their training. Different LLMs displayed unique propensities toward associative and normative reasoning, exemplified by varied response ranges.

Claude and GPT-4o were notable for high normative inference correlations, surpassing the alignment demonstrated by humans. Conversely, Gemini-Pro and GPT-3.5 exhibited less alignment with normative models, highlighting limitations in causal comprehension.

The study supports using causal Bayes nets to fit LLM inferences to a normative standard, revealing the extent to which LLMs utilize domain knowledge. Additionally, the ability of some LLMs (like GPT-4o) to deviate from associative reasoning underpins advancements in developing AI systems with improved causal reasoning abilities.

Conclusion

The research indicates that while LLMs like GPT-4o and Claude align closely with normative causal reasoning standards, others such as Gemini-Pro still exhibit significant associative tendencies. This variation underscores the potential for these models to support decision-making processes, although recognizing domain knowledge's impact remains crucial. Future research should broaden causal structure complexity and deepen analysis on domain knowledge effects.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What this paper is about (in simple terms)

This paper asks: Do big AI chatbots think about cause and effect like people do? The authors test four well-known LLMs (GPT-3.5, GPT-4o, Claude-3 Opus, and Gemini-Pro) on the same kind of cause-and-effect problems that college students solved in a psychology study. They compare how “human-like” the AIs are, and how close they are to what the math says you should conclude in these situations.

The main questions the researchers asked

Can LLMs do basic cause-and-effect reasoning, not just repeat patterns from text?
Do they show the same kinds of mistakes people make?
Which models match the “correct” answers best (the ones predicted by a standard math model of causality)?
Do models rely on background world knowledge in ways people didn’t in this experiment?

How the study worked (with simple analogies)

The team focused on a simple cause-and-effect setup called a “collider.” Think of it like this:

Two different causes, C1 and C2, can both lead to the same effect, E.
Example: Rain (C1) and the sprinkler (C2) can both make the grass wet (E).

From this setup, four types of questions were asked:

Predictive: If we know which causes are happening, how likely is the effect?
- Example: If the sprinkler is on and it’s raining, how likely is wet grass?
Independence of causes: Before looking at the effect, the two causes shouldn’t affect each other.
- Example: Just knowing the sprinkler is on shouldn’t make rain more or less likely.
Diagnostic (effect present): If the effect is happening (the grass is wet), learning that one cause happened (sprinkler was on) should make the other cause less likely (rain). This is called “explaining away.”
- In everyday words: If you know the grass is wet and you also know the sprinkler was on, you don’t need rain to explain the wetness, so rain becomes less likely.
Diagnostic (effect absent): If the effect is not happening (grass is dry), both causes become less likely, and “explaining away” isn’t expected.

What did participants do?

Humans: In a past study, college students learned short stories from three areas (weather, economy, society) describing these cause-effect links. Then they rated how likely something was (0–100).
AIs: The same stories and questions were given to the four LLMs via their APIs. The models were asked to reply with a single number from 0 to 100. The test used temperature 0.0 (the “most consistent” setting) so the outputs didn’t vary randomly.

What is the “math model” they compared to?

A causal Bayes net (think: a formal calculator for cause-and-effect). It tells you the “normative” answer—what you should conclude if you follow the rules of probability exactly.
They also tried a psychology-inspired model called the “mutation sampler,” which imitates how people might take a few mental “samples” instead of doing perfect math, leading to human-like shortcuts and mistakes.

What they found and why it matters

Here are the main takeaways:

All models did the basics: They understood that more causes make the effect more likely (predictive reasoning worked).
Explaining away (diagnostic with effect present):
- GPT-4o and Claude showed strong “explaining away” (very close to what the math predicts).
- Gemini-Pro and GPT-3.5 didn’t explain away—they sometimes did the opposite, treating causes as if they made each other more likely.
Independence of causes (before seeing the effect, causes shouldn’t influence each other):
- Humans often break this rule a bit (a known human bias).
- Gemini-Pro and GPT-3.5 also broke it, and even more than humans.
- Claude violated it the least.
- GPT-4o showed a small violation in the opposite direction.
How close were the models to humans overall?
- Claude and GPT-4o lined up most with human answer patterns.
How close were the models to the “correct math” answers?
- GPT-4o and Claude matched the math model best (even better than the average human in this study).
- GPT-3.5 and Gemini-Pro were worse than humans by this measure.
Do models use background knowledge?
- Humans were trained to ignore their prior knowledge and just use the given story, and their answers didn’t change much across domains.
- The AIs did vary more by topic (weather vs. economy vs. society), suggesting they pulled in outside knowledge from training.
Are AI answers too extreme?
- The AIs used the full 0–100 range more than humans did. One practical reason: humans used a slider that started at 50, which nudged them toward the middle.
Human-like mistakes in AIs:
- A psychology-inspired “shortcut” model (mutation sampler) fit humans well and also fit three of the four AIs better than the pure math model. This suggests many AIs show human-like associative shortcuts—except GPT-4o, which stayed closest to the math.

Why this is important

Good news: Some AI models (GPT-4o and Claude) can reason about cause and effect in ways that are both human-like and close to the correct math.
Caution: Other models (GPT-3.5 and Gemini-Pro) showed biases like people do—sometimes even stronger—such as failing to explain away or treating independent causes as if they go together.
Big picture: As AIs help with real decisions (health, policy, safety), we need to know their strengths and biases in cause-and-effect reasoning, not just their ability to sound fluent.

What this means for the future

Testing matters: We should routinely check AIs for causal reasoning biases, because these can change by model, topic, and even prompt settings.
Better design: Insights from this work can guide training and prompting so models follow causal rules more reliably.
More to explore: This study focused on one simple network (a collider). Future work could test more complex cause-and-effect patterns, actions and interventions, and how prompting (like chain-of-thought) or temperature settings change performance.

In short: Some LLMs can reason about causes impressively well—sometimes even more “by-the-book” than people—while others still fall into common human-like traps. Knowing the difference helps us use AI more safely and wisely.

Do Large Language Models Reason Causally Like Us? Even Better?

Summary

Do LLMs Reason Causally Like Us? Even Better?

Introduction

Methodology

Participants and Models

Materials and Procedure

Results

Inference Patterns

Discussion

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What this paper is about (in simple terms)

The main questions the researchers asked

How the study worked (with simple analogies)

What they found and why it matters

Why this is important

What this means for the future

Open Problems

Continue Learning

Authors (4)

Collections

Tweets

Don't miss out on important new AI/ML research

Do Large Language Models Reason Causally Like Us? Even Better?

Summary

Do LLMs Reason Causally Like Us? Even Better?

Introduction

Methodology

Participants and Models

Materials and Procedure

Results

Inference Patterns

Discussion

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What this paper is about (in simple terms)

The main questions the researchers asked

How the study worked (with simple analogies)

What they found and why it matters

Why this is important

What this means for the future

Open Problems

Continue Learning

Related Papers

Authors (4)

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research