The Third Pillar of Causal Analysis? A Measurement Perspective on Causal Representations

Published 23 May 2025 in cs.LG | (2505.17708v2)

Abstract: Causal reasoning and discovery, two fundamental tasks of causal analysis, often face challenges in applications due to the complexity, noisiness, and high-dimensionality of real-world data. Despite recent progress in identifying latent causal structures using causal representation learning (CRL), what makes learned representations useful for causal downstream tasks and how to evaluate them are still not well understood. In this paper, we reinterpret CRL using a measurement model framework, where the learned representations are viewed as proxy measurements of the latent causal variables. Our approach clarifies the conditions under which learned representations support downstream causal reasoning and provides a principled basis for quantitatively assessing the quality of representations using a new Test-based Measurement EXclusivity (T-MEX) score. We validate T-MEX across diverse causal inference scenarios, including numerical simulations and real-world ecological video analysis, demonstrating that the proposed framework and corresponding score effectively assess the identification of learned representations and their usefulness for causal downstream tasks.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper proposes a measurement model framework that views learned representations as proxy measurements for true latent causal variables.
It introduces the T-MEX score, a novel metric based on conditional independence tests, to quantify the exclusivity of learned measurements.
Experiments in both simulated and real-world settings validate T-MEX's effectiveness in assessing causal validity and guiding model selection.

This paper proposes a new framework for understanding and evaluating Causal Representation Learning (CRL) (2505.17708). The authors argue that while CRL aims to identify latent causal variables from observed data, existing methods lack a clear understanding of what makes learned representations useful for downstream causal tasks and how to properly evaluate them.

The central contribution is the reinterpretation of CRL through a measurement model framework. In this framework, learned representations ($\widehat{\Zb}$) are viewed as proxy measurements of the true underlying latent causal variables ($\Zb$). A measurement model $\Mcal = \langle \Zb, \widehat{\Zb}, \{h_j\}_{j=1}^{M}\rangle$ specifies that each observed measurement variable $\widehat{\Zb}_{A_j}$ is a deterministic function $h_j$ of a subset of latent causal variables, its "causal parents" $\Zb_{\text{pa}(\widehat{\Zb}_{A_j})}$. This perspective allows for a more precise characterization of when a learned representation can support causal reasoning. The paper also introduces the concept of a "causally valid" measurement model, which means the learned measurements $\widehat{\Zb}$ can be used as a drop-in replacement for the true causal variables $\Zb$ in a specific statistical estimand $g$ (i.e., $g(\Zb) = g(\widehat{\Zb})$).

Building on this framework, the paper introduces a novel evaluation metric called the Test-based Measurement EXclusivity (T-MEX) score. The T-MEX score quantifies how well the learned representations align with a hypothesized measurement model, specifically focusing on the "exclusivity" of measurements. Exclusivity refers to whether a measurement variable $\widehat{\Zb}_{A_j}$ measures only its designated causal parent(s) $\Zb_i$ and is independent of other causal variables, conditioned on its true parents.

The T-MEX score is computed as follows:

Define an adjacency matrix $V \in \{0,1\}^{N \times M}$ based on the hypothesized measurement model, where $V_{ji}=1$ if latent variable $\Zb_i$ is a parent of measurement $\widehat{\Zb}_{A_j}$, and $0$ otherwise.
Construct a matrix $\widehat{W} \in \{0,1\}^{N \times M}$ based on statistical tests for conditional independence. For each pair $(\Zb_i, \widehat{\Zb}_{A_j})$, test the null hypothesis $\cH_0(i, j): \widehat\Zb_{A_j}\indep \Zb_i\given \Zb_{[N]\setminus\{i\}}$. $\widehat{W}_{ji}=1$ if $\cH_0(i,j)$ is rejected (dependency detected), and $0$ otherwise.
The T-MEX score is the Hamming distance between $V$ and $\widehat{W}$ : $\text{T-MEX}(V,\widehat{W}) \coloneqq \sum_{j=1}^M\sum_{i=1}^N \mathbbm{1}(V_{ji} \neq \widehat{W}_{ji})$. A lower T-MEX score indicates better alignment with the measurement model. The paper provides a proposition bounding the expected T-MEX score if the learned representations perfectly align with the model, highlighting its dependency on the significance level ( $\alpha$ ) and power ( $\beta$ ) of the conditional independence tests used.

The authors critique existing CRL evaluation metrics like the coefficient of determination ( $R^2$ ) and Mean Correlation Coefficient (MCC). They demonstrate that $R^2$ measures predictability rather than identifiability and can be misleading when latent variables are causally related. MCC, even with permutations, fails to capture if individual latent variables are truly disentangled in the learned representations when causal dependencies exist.

The T-MEX score's effectiveness is validated through two main experiments:

Numerical Simulation:
- Setup: Five causal variables ($\Zb_1, ..., \Zb_5$) are generated from a linear structural causal model. $\Zb_4$ (treatment) and $\Zb_5$ (outcome) are observed, while $\Zb_1, \Zb_2, \Zb_3$ are latent, with $\Zb_1$ being a confounder. The task is to estimate the Average Treatment Effect (ATE) of $\Zb_4$ on $\Zb_5$ by adjusting for $\Zb_1$, which is learned as $\widehat{\Zb}_{A_1}$ from mixed observations $\Xb = f(\Zb_1, \Zb_2, \Zb_3)$. Three CRL models are compared: Model A (well-trained), Model B (insufficiently trained), and Model C (corrupted).
- Results: Model A achieved a low T-MEX score and low ATE bias. Models B and C had high T-MEX scores and high ATE bias. $R^2$ scores were high for all models regarding $\Zb_1$ and did not correlate well with ATE bias, unlike T-MEX which showed a strong correlation. This demonstrates T-MEX's ability to assess both identifiability and causal validity.
Real-world Ecological Experiment (ISTAnt benchmark):
- Setup: Data from the ISTAnt benchmark [cadei2024smoke] involves video recordings of ant triplets to estimate the ATE of a pathogen exposure (treatment $\Tb$) on grooming behavior (latent outcome $\Yb$, learned as $\widehat{\Yb}$). The measurement model hypothesizes that $\widehat{\Yb}$ exclusively measures $\Yb$. 2,400 different models were trained.
- Results: Models with T-MEX=0 (indicating alignment with the exclusivity assumption that $\widehat{\Yb}$ is independent of $\Tb$ given $\Yb$) showed significantly lower ATE bias and higher classification accuracy compared to models with T-MEX=1. A Mann-Whitney U test confirmed that models with T-MEX=1 had a statistically significantly higher average absolute ATE bias. This highlights T-MEX's utility in real-world scenarios, especially as it can be computed from observational data without requiring a validation set from a randomized controlled trial (RCT) for ATE bias calculation.

The paper concludes that the measurement model perspective and the T-MEX score provide a principled and practical way to evaluate CRL methods. T-MEX is agnostic to the specific conditional independence test used, allowing practitioners to choose tests suitable for their assumptions. A key advantage is its applicability even when true ATE bias is unknown, common in real-world settings lacking RCT data.

Practical Implications:

Principled Evaluation: T-MEX offers a more reliable metric than $R^2$ or MCC for assessing whether learned representations correctly identify and isolate specific latent causal factors, especially when these factors are causally related.
Model Selection and Debugging: Developers can use T-MEX to select among different CRL models or hyperparameter settings, preferring those with lower T-MEX scores for a given hypothesized measurement model. It can also help diagnose why a CRL model might be failing in downstream causal tasks (e.g., by identifying which exclusivity conditions are violated).
Downstream Task Suitability: Before deploying a CRL model for a specific causal inference task (like ATE estimation), T-MEX can provide an upfront assessment of whether the learned representations are "causally valid" for that task, particularly regarding confounding or mediation.
Guidance for CRL Algorithm Design: The measurement model framework itself can guide the design of new CRL algorithms by making explicit the desired relationship (measurement functions and exclusivity) between latent variables and their representations.

Conditional Independence Testing: The choice of conditional independence test (e.g., PCM test used in the paper, kernel-based tests, or mutual information-based tests) is crucial. Practitioners need to select tests appropriate for their data types and computational resources. The paper uses the PCM test [lundborg2024projected].

# Pseudocode for computing T-MEX
# Inputs:
#   Z_samples: samples of true latent causal variables (N_samples x N_latents)
#   Z_hat_samples: samples of learned representations (N_samples x M_representations)
#   V_adj_matrix: hypothesized adjacency matrix (N_latents x M_representations)
#   ci_test_function: a function that performs conditional independence tests
#   alpha: significance level for CI tests

def compute_t_mex(Z_samples, Z_hat_samples, V_adj_matrix, ci_test_function, alpha):
    N_latents = Z_samples.shape[1]
    M_representations = Z_hat_samples.shape[1]
    W_hat_adj_matrix = np.zeros((N_latents, M_representations))

    for j in range(M_representations): # For each learned representation component
        Z_hat_j = Z_hat_samples[:, j]
        for i in range(N_latents): # For each true latent causal variable
            Z_i = Z_samples[:, i]
            # Conditioning set: all other true latents Z_k where k != i
            Z_cond = np.delete(Z_samples, i, axis=1)

            # H0: Z_hat_j _||_ Z_i | Z_cond
            p_value = ci_test_function(Z_hat_j, Z_i, Z_cond)

            if p_value < alpha: # Reject H0 => dependency
                W_hat_adj_matrix[i, j] = 1
            else: # Fail to reject H0 => conditional independence
                W_hat_adj_matrix[i, j] = 0
    
    t_mex_score = np.sum(V_adj_matrix != W_hat_adj_matrix)
    return t_mex_score

Computational Considerations: Computing T-MEX involves multiple conditional independence tests. The computational cost will depend on the number of latent variables, representation dimensions, sample size, and the complexity of the chosen CI test. For high-dimensional data, efficient CI tests are necessary. The paper notes multiple testing adjustments (e.g., Bonferroni-Holm) might be needed if tests are run on the same dataset.

In essence, this work provides a clearer theoretical grounding and a practical tool for assessing the quality and utility of representations learned by CRL methods, aiming to establish CRL as a more robust "third pillar" of causal analysis alongside causal reasoning and discovery.

Markdown Report Issue