LLM-Constructed Bayesian Networks

Updated 9 February 2026

LLM-constructed Bayesian networks are probabilistic graphical models generated via natural language prompts that fuse pretrained model knowledge with causal reasoning.
They employ methods like PromptBN and ReActBN to autonomously structure networks and estimate conditional probability tables, ensuring acyclicity and accuracy.
Applications in high-stakes domains such as financial trading demonstrate improved performance and risk management compared to traditional learning algorithms.

LLMs have emerged as both structure generators and probabilistic parameterization engines for Bayesian networks (BNs), enabling automatic or semi-automatic construction of expressive, interpretable probabilistic graphical models across a range of domains. LLM-constructed Bayesian networks leverage the factual, causal, and statistical knowledge encoded in pretrained models, operationalized via natural language prompts, to generate either the directed acyclic graph (DAG) structure, the conditional probability tables (CPTs), or both. The resulting workflows produce BNs that encode explicit, verifiable dependencies among variables, thereby combining the transparency of graphical modeling with the contextual reasoning abilities of LLMs (Zhang et al., 1 Nov 2025, Nafar et al., 21 May 2025, Kuang et al., 30 Nov 2025).

1. Structure Discovery: LLM-Driven Bayesian Network Construction

LLMs can be utilized as central agents for BN structure elicitation, either in a data-free regime utilizing their world knowledge or in data-aware settings that combine model reasoning with statistical evaluation. In the PromptBN algorithm (Zhang et al., 1 Nov 2025), the LLM receives a meta-prompt consisting of variable names, natural-language descriptions, and distributional information. The LLM outputs the graph structure in two JSON-based formats: node-centric (listing parents for each node) and edge-centric (enumerating directed edges with justifications). The agent validates acyclicity and agreement between representations, retrying if necessary.

ReActBN extends this by integrating observational data and structure scores (e.g., Bayesian Information Criterion, BIC):

$S(G; D) = \log P(D \mid G) - \frac{|G|}{2}\log N,$

where $|G|$ is the number of free parameters and $N$ sample size. The LLM proposes and evaluates candidate graph modifications (add, remove, flip edges), using the structure score differential to guide iterative refinement. The LLM receives context on current scores and candidate moves, promoting interpretability via explicit reasoning steps.

A tabular summary of key methods:

Method	LLM Role	Data Required	Query Complexity
PromptBN	Structure generator	None	$O(1)$
ReActBN	Structure + refinement	Optional	$O(1)$ per iter
PairwisePrompt/BFSPrompt	Local edge suggestion	None	$O(n^2)$ / $O(n)$

PromptBN achieves perfect recovery of small, classical nets in the zero-data regime (e.g., SHD=0 on Asia), and ReActBN outperforms traditional learners (PC-Stable, HC) in low-data scenarios (Zhang et al., 1 Nov 2025).

2. Parameter Estimation: Populating CPTs with LLM-Extracted Probabilities

Given a fixed DAG (whether LLM- or expert-constructed), LLMs can be queried directly for probabilistic parameterization. For each node $X$ and parent configuration $\text{pa}$ , natural language prompts elicit either scalar or vector-valued probability estimates:

EPK (Explicit Probability Knowledge) prompts separately for $P(X = x_i | \text{pa})$ .
Full Dist prompts request the full conditional distribution in one call.
Token-probability baselines use model next-token likelihoods in multiple-choice format (Nafar et al., 21 May 2025).

Extracted LLM outputs are normalized to valid distributions:

$|G|$ 0

where $|G|$ 1 is the raw score for state $|G|$ 2. With empirical data, priors $|G|$ 3 from the LLM and MLE estimates $|G|$ 4 are combined via linear pooling,

$|G|$ 5

or via Bayesian pseudocounts,

$|G|$ 6

where $|G|$ 7 is the count of $|G|$ 8 under $|G|$ 9, $N$ 0 controls prior strength, and $N$ 1 is the total count.

Empirically, EPK with normalization significantly outperforms uniform and random baselines. The median KL divergence to ground truth matches MLE with $N$ 230 i.i.d. samples; hybrid LLM+data approaches (EDP) achieve superior parameterization even with 3–10 samples per CPT row (Nafar et al., 21 May 2025).

3. Contextual Applications and Hybrid Architectures

In operational systems, LLM-constructed BNs have been employed for high-stakes decision-making, notably in algorithmic financial trading. For the options wheel strategy, an LLM interprets the current market, technical signals, and psychological variables to construct a bespoke DAG for each trade (Kuang et al., 30 Nov 2025). The LLM’s output specifies relevant nodes and causal directions; historical data matching the current context are selected to populate CPTs. Probabilistic inference over the BN then yields posterior distributions for outcomes (e.g., profit, assignment probability, drawdown risk).

Risk metrics (expected return, Value at Risk, Sharpe ratio) are calculated directly from BN marginals. A feedback mechanism incrementally updates CPTs via Dirichlet counts and prompts the LLM to revise structure templates based on trade outcomes and observed edge-effectiveness. Over nearly two decades of out-of-sample deployment, such architectures achieve transparent, explainable decisions with superior risk-adjusted returns (Sharpe ratio 1.08 vs. 0.62; max drawdown –8.2% vs. –60% for baseline) (Kuang et al., 30 Nov 2025).

4. Prompt Engineering, Scalability, and Model Selection

Prompt design critically impacts BN quality. Concise, unambiguous variable and state descriptions are recommended, with additional elaboration required for polysemous or domain-specific nodes. In the structure phase, chain-of-thought rationale is enforced for interpretability but offers no advantage in numerical accuracy for CPT extraction (Nafar et al., 21 May 2025).

Computational scalability is shaped by LLM–workflow interaction:

PromptBN and ReActBN both achieve $N$ 3 queries per graph, a sharp reduction over $N$ 4 methods.
However, prompt input grows with the number of variables; $N$ 5 may reach token limits.
Structure recovery and parameter accuracy both decline as BN size increases, consistent with increased combinatorial ambiguity and model uncertainty (Zhang et al., 1 Nov 2025).

Model choice is also domain-dependent. Among general LLMs, GPT-4o and o3-pro perform consistently, with vertical fine-tuning yielding incremental gains for specific variable semantics (Nafar et al., 21 May 2025, Zhang et al., 1 Nov 2025).

5. Empirical Performance and Evaluation Metrics

Performance is assessed on benchmark datasets spanning healthcare, engineering, finance, and synthetic nets:

Structure accuracy: Structural Hamming Distance (SHD), Normalized Hamming Distance (NHD).
Parameter fit: KL divergence between learned and ground-truth joint distributions, aggregated via local CPT comparisons weighted by parent marginals.

In zero-data settings, LLM-constructed BNs match or exceed hand-engineered models on small nets and remain competitive at moderate scale. In low-data regimes, LLM–data hybrids (ReActBN for structure; EDP for CPTs) outperform classical learning algorithms, with variance and systematic bias reduced in rare parent configurations (Zhang et al., 1 Nov 2025, Nafar et al., 21 May 2025).

A summary table of empirical findings:

Regime	LLM method	Metric	Comparative Result
No data	PromptBN	SHD/NHD	Perfect/competitive; O(1) queries [Asia, Insurance]
Low data (k=30)	EPK+EDP	KL	Matches/exceeds MLE-k, lower variance
Data-aware	ReActBN	SHD	Superior to PC, HC, RAI-BF on benchmarks

6. Limitations and Prospective Directions

Current approaches rely on strict parsing of LLM output (structured JSON or equivalent), incurring fragility if out-of-schema responses occur. For very large networks, both prompt length and the LLM’s capacity to reason over $N$ 6 variables become limiting. All structure discovery methods require validation of acyclicity; large-scale generalizations may necessitate grammar-constrained generation or hybrid multi-expert approaches (Zhang et al., 1 Nov 2025).

Extensions under consideration include robust prompt-schema optimization, ensemble LLMs to reduce individual model variance, adaptive prompt truncation, and extension to dynamic or causal networks (e.g., incorporating do-calculus, continuous/mixed distributions). Downstream tasks may require application-specific error criteria beyond KL or SHD, particularly where absolute risk tolerances are enforced.

LLM-constructed Bayesian networks constitute a unifying paradigm blending expressive causal probabilistic models with advanced pretrained knowledge representations, with demonstrated utility in both classic structure discovery and parameter learning, and practical effectiveness in data-impoverished and high-stakes operational contexts (Zhang et al., 1 Nov 2025, Nafar et al., 21 May 2025, Kuang et al., 30 Nov 2025).