CauSight: Learning to Supersense for Visual Causal Discovery

Published 1 Dec 2025 in cs.CV | (2512.01827v1)

Abstract: Causal thinking enables humans to understand not just what is seen, but why it happens. To replicate this capability in modern AI systems, we introduce the task of visual causal discovery. It requires models to infer cause-and-effect relations among visual entities across diverse scenarios instead of merely perceiving their presence. To this end, we first construct the Visual Causal Graph dataset (VCG-32K), a large-scale collection of over 32,000 images annotated with entity-level causal graphs, and further develop CauSight, a novel vision-LLM to perform visual causal discovery through causally aware reasoning. Our training recipe integrates three components: (1) training data curation from VCG-32K, (2) Tree-of-Causal-Thought (ToCT) for synthesizing reasoning trajectories, and (3) reinforcement learning with a designed causal reward to refine the reasoning policy. Experiments show that CauSight outperforms GPT-4.1 on visual causal discovery, achieving over a threefold performance boost (21% absolute gain). Our code, model, and dataset are fully open-sourced at project page: https://github.com/OpenCausaLab/CauSight.

Abstract PDF Upgrade to Chat

Summary

The paper introduces CauSight, a framework that infers cause-and-effect relationships among visual entities using the annotated VCG-32K dataset.
It employs a tree-of-causal-thought strategy with Monte Carlo Tree Search and reinforcement learning to optimize reasoning trajectories.
Results demonstrate a 21% recall gain over GPT-4.1 and enhanced cross-domain generalization on datasets like Objects365.

CauSight: Learning to Supersense for Visual Causal Discovery

Introduction

The paper "CauSight: Learning to Supersense for Visual Causal Discovery" (2512.01827) introduces a novel framework for visual causal discovery, a task where models are required to infer cause-and-effect relationships among visual entities in diverse scenarios. This task transcends traditional visual recognition, demanding models to understand not only what is present in a scene but also why events occur. To support this goal, the authors present the Visual Causal Graph dataset (VCG-32K), comprising over 32,000 images annotated with causal graphs at the entity level. The proposed model, CauSight, leverages VLMs to perform causally aware reasoning and demonstrate superior performance in visual causal discovery.

Figure 1: A comparison between VLMs that understand (a) scene graph, which specifies spatial relations between entities; (b) causal graph, which captures causal mechanisms between entities. Genuine reasoning requires discovering causal relations.

Methodology

The CauSight framework integrates three main components to facilitate visual causal discovery:

Training Data Curation: VCG-32K provides entity-level causal graph annotations derived from popular datasets like MS-COCO and Objects365, ensuring a robust grounding for causal inference.
Tree-of-Causal-Thought (ToCT): This component synthesizes reasoning trajectories using a tree-based strategy where region selection, entity recognition, and causality orientation are executed in a loop. The Monte Carlo Tree Search (MCTS) algorithm supports this process by exploring multiple reasoning paths, ultimately generating high-quality trajectories.
Reinforcement Learning: A specialized causal reward structure guides reinforcement learning, refining the model’s reasoning policy. The GRPO algorithm is employed for policy optimization without requiring a separate value function, enabling efficient adaptation to visual causal structures.
Figure 2: The two-stage annotation pipeline of VCG-32K: bounding box refinement and causal relationship labeling.

Results

Experimental analysis showcases CauSight's significant advantages in visual causal discovery, particularly when compared to proprietary models such as GPT-4.1 and OpenAI o3. Notable improvements include a 21% absolute gain in recall over GPT-4.1, emphasizing CauSight's efficacy in constructing coherent causal graphs from visual data.

Figure 3: Illustration of a single synthesized reasoning trajectory. The teacher model can repeatedly execute three key actions to extend the reasoning trajectory.

Additionally, CauSight demonstrated strong cross-domain generalization on the Objects365 dataset subset, outperforming baseline models that lacked explicit causal reasoning frameworks. The ability to integrate causal prior knowledge through ToCT and the RL stage, while maintaining detection and reasoning balance, underpins its superior performance.

Implications and Future Directions

CauSight represents a pivotal step towards integrating causal reasoning in VLMs, emphasizing the importance of understanding causal relationships for both practical applications like robotics and autonomous systems, and theoretical developments in AI interpretability and intelligence. Future research could explore expanding model capabilities to incorporate dynamic scenes or temporal causality, further aligning AI systems with human-like reasoning.

Figure 4: Model generalizability across three OOD benchmarks.

Conclusion

The paper details a comprehensive approach to advancing visual causal discovery by leveraging a novel dataset, VCG-32K, and proposing a causally aware model, CauSight. The model not only enhances the understanding of causality in visual contexts but also points toward broader implications in AI reasoning and decision-making. By grounding AI systems in causal inference, researchers can expect improved performance and generalization, setting a foundational precedent for future explorations in causal AI.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

CauSight: Learning to Supersense for Visual Causal Discovery

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Explaining “CauSight: Learning to Supersense for Visual Causal Discovery”

What is this paper about?

This paper is about teaching AI to look at a picture and figure out not just what things are, but why things are the way they are. In other words, the AI learns cause-and-effect in images. For example, if a glass is on a stack of books that sits on a laptop, the AI should understand that pulling out the laptop could make the books fall and the glass drop. The authors call this skill visual causal discovery.

What were the researchers trying to do?

The paper has three simple goals:

Build a large, trustworthy dataset of images that shows objects and who affects whom (cause-and-effect links).
Create a new model, called CauSight, that can discover these cause-and-effect links in many different kinds of pictures.
Train the model in a way that helps it think through causes step by step and improve through feedback, not just copy answers.

How did they do it? (Methods in everyday language)

To reach these goals, the researchers did three main things.

They built a dataset for cause-and-effect in images (VCG-32K)

What it is: 32,000+ images with detailed boxes around objects and arrows showing which object causes another to be in its current state (like “the table supports the vase”).
How they decided a cause exists: If removing object A would change object B’s state right now (for example, removing a chair would make a person no longer “sit”), then A causes B.
Why this matters: Most older datasets only say where things are (like “cup on table”) but not why (“table supports cup”). This new dataset captures the “why.”

They taught the model to reason in steps using a “Tree-of-Causal-Thought” (ToCT)

Think of it like exploring a maze: at each step, the model picks a region to look at, identifies objects there, and then decides who affects whom. It builds a tree of possible reasoning paths and uses a search method to pick the best path.
Monte Carlo Tree Search (MCTS): This is like trying several promising paths in a game to see which one likely leads to a win. Here, “winning” means correctly figuring out cause-and-effect in the image.
A stronger “teacher” model generates many step-by-step examples. The student model (CauSight) learns from only the best ones, so it picks up good habits.

They improved the model with practice and rewards (reinforcement learning)

Reinforcement learning is like training with a coach who scores your performance. The model tries, gets feedback, and improves.
The “causal reward” gives points for:
- Recall: finding the true cause-and-effect connections.
- Precision: avoiding wrong connections.
- Format: writing the answer in the correct structure.
This helps the model not only think clearly but also produce clean, reliable outputs.

Extra note on evaluation: When judging the model, they match predicted objects to real ones and check whether the arrows (who causes whom) point in the right direction. They mostly care about the structure of the connections, not the exact names of the objects.

What did they find, and why is it important?

Big performance gains: CauSight beat powerful general-purpose AI systems (including GPT-4.1) on discovering cause-and-effect in images, with more than a threefold improvement on average in key metrics.
Strong generalization: It worked well not only on images it “trained on,” but also on images from a different dataset. That means it learned useful causal principles, not just memorized examples.
Better balance of “seeing” and “thinking”: The model kept solid object detection while becoming much better at reasoning about cause and effect. Other models were often good at one but weak at the other.
Step-by-step thinking helps: Training with ToCT (good examples of reasoning steps) and then reinforcement learning (practice with feedback) gave the best results, especially on new, unseen images.

This matters because understanding “why” is crucial for safe decisions in the real world. For example:

Robots can plan safer actions (don’t pull the laptop if it will topple a glass).
Self-driving cars can reason about chains of events (if this car brakes, what will that cause?).
AI becomes more explainable and trustworthy by showing the causal links it believes in.

What’s the bigger impact?

A new task for AI: Visual causal discovery gives AI a way to move beyond “what is there” to “why it matters.”
Public resources: The dataset (VCG-32K), the code, and the model (CauSight) are open-source, so others can build on this work.
Safer, smarter systems: With better causal understanding, AI can make more reliable choices, especially in areas like robotics, autonomous driving, and home assistants.
A path forward: The two-phase training—first learn good reasoning steps, then improve with rewards—shows a practical way to teach AI to think more like humans about cause and effect.

In short, this paper takes a major step toward AI that doesn’t just see the world—it understands how actions lead to consequences.

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (7)

Collections

GitHub

GitHub - OpenCausaLab/CauSight: CauSight: Learning to Supersense for Visual Causal Discovery (3 stars)

CauSight: Learning to Supersense for Visual Causal Discovery

Summary

CauSight: Learning to Supersense for Visual Causal Discovery

Introduction

Methodology

Results

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Explaining “CauSight: Learning to Supersense for Visual Causal Discovery”

What is this paper about?

What were the researchers trying to do?

How did they do it? (Methods in everyday language)

What did they find, and why is it important?

What’s the bigger impact?

Open Problems

Continue Learning

Related Papers

Authors (7)

Collections

GitHub