Improving constraint-based discovery with robust propagation and reliable LLM priors

Published 28 Sep 2025 in cs.LG | (2509.23570v1)

Abstract: Learning causal structure from observational data is central to scientific modeling and decision-making. Constraint-based methods aim to recover conditional independence (CI) relations in a causal directed acyclic graph (DAG). Classical approaches such as PC and subsequent methods orient v-structures first and then propagate edge directions from these seeds, assuming perfect CI tests and exhaustive search of separating subsets -- assumptions often violated in practice, leading to cascading errors in the final graph. Recent work has explored using LLMs as experts, prompting sets of nodes for edge directions, and could augment edge orientation when assumptions are not met. However, such methods implicitly assume perfect experts, which is unrealistic for hallucination-prone LLMs. We propose MosaCD, a causal discovery method that propagates edges from a high-confidence set of seeds derived from both CI tests and LLM annotations. To filter hallucinations, we introduce shuffled queries that exploit LLMs' positional bias, retaining only high-confidence seeds. We then apply a novel confidence-down propagation strategy that orients the most reliable edges first, and can be integrated with any skeleton-based discovery method. Across multiple real-world graphs, MosaCD achieves higher accuracy in final graph construction than existing constraint-based methods, largely due to the improved reliability of initial seeds and robust propagation strategies.

Abstract PDF Upgrade to Chat

Summary

The paper introduces MosaCD, a method that combines LLM-based orientation seeding with traditional CI tests to construct more accurate causal graphs.
It employs a robust propagation strategy to mitigate error propagation typical of PC algorithms, significantly enhancing orientation accuracy.
Empirical evaluations on benchmark datasets demonstrate improved F1 scores and reliability, validating MosaCD's potential in advanced causal inference.

Improving Constraint-Based Discovery with Robust Propagation and Reliable LLM Priors

Introduction

The paper discusses a novel approach named MosaCD that aims to enhance causal discovery by integrating robust LLM-based orientation seeding with constraint-based methods. The primary focus is to improve the reliability of initial seed orientations and to reduce the error propagation inherent in traditional methods such as the PC algorithm. The innovative aspect lies in combining conditional independence (CI) tests and LLM annotations for a more reliable set of causal inferences.

Methodology and Algorithm

MosaCD is structured into several distinct phases, starting with the construction of an undirected skeleton via traditional constraint-based methods, followed by seeding orientations using LLM-derived high-confidence seeds. This is further refined through a confidence-down propagation strategy that emphasizes reliable edge directions. The high-level steps of MosaCD are:

Skeleton Search: An initial phase where a skeleton graph is constructed by executing a series of CI tests to eliminate edges. This utilizes existing algorithms such as PC, CPC, and PC-stable.
LLM-based Orientation Seeding: The key innovation of MosaCD, where an LLM is tasked to propose likely reliable edge directions. This step significantly mitigates hallucination issues via shuffled queries that exploit LLMs' positional biases.
Iterative Orientation Propagation: This phase involves applying propagation rules to finalize edge orientations. The procedure prioritizes evidence-supported orientations, significantly reducing error propagation.
Least-Conflict Resolution: Remaining undirected edges are finalized by choosing the direction that conflicts with minimal CI evidence, ensuring higher coherence with known data.
Optional Finalization: Incorporates LLM votes to address any remaining ambiguities in edge directionality.

The algorithm operates within a mathematical framework that ensures convergence to the correct partial directed acyclic graph (PDAG) when a perfect CI oracle is available.

Theoretical Analysis

The paper demonstrates that MosaCD's orientation strategy improves upon traditional methods by showing analytically that prioritizing non-colliders over colliders reduces the overall error rate. This is theoretically justified using a stylized model of search space traversal where independence detection is less error-prone compared to dependence in CI tests.

Empirical Evaluation

MosaCD's performance is empirically validated across numerous standard benchmark datasets. The key results show an improved F1 score across various datasets when compared with both traditional constraint-based methods and recent LLM-integrated approaches. The gains are attributed to the more accurate seeding and robust propagation strategies that MosaCD employs.

Accuracy of orientation seeds

Figure 1: Number of true and false directions discovered by MosaCD's LLM-based orientation seeding compared to standard PC procedures.

The empirical section also contains ablation studies demonstrating the robustness of MosaCD against variations in description informativeness and initialization conditions, further affirming its practical advantage.

Conclusion

The MosaCD method presents a promising enhancement to causal discovery algorithms, particularly in leveraging LLMs as informed yet cautious contributors to the orientation seeding process. By improving the reliability of initial seeds and employing a more strategic propagation process, MosaCD reduces the propagation of errors traditionally resulting from inaccuracies in CI tests. While the method requires a robust LLM and initial computational resources for orientation seeding, the improved accuracy and robustness of the resulting causal graphs offer significant advantages for scientific modeling and decision-making processes reliant on causal inference.

Future exploration could involve optimizing MosaCD's integration with more specific domain knowledge or further exploring the balance between LLM reliance and computational resource allocation, particularly in data-rich environments.

Markdown Report Issue