Causal Models with Design
- Causal models with design are formal structural causal models augmented with explicit study design features such as sampling, selection, and measurement processes.
- They employ augmented graphical models and structural equations to transparently communicate assumptions and perform identifiability analysis in complex empirical studies.
- This framework enables unbiased estimation of causal effects in various settings, including case-control studies, trials with noncompliance, and surveys with missing data.
Causal models with design constitute a formal integration of structural causal models (SCMs) with explicit representation of study design, treatment assignment, missing data mechanisms, and measurement processes. This framework incorporates both the underlying causal structure and the operational details of data collection—enabling systematic communication, identifiability analysis, and unbiased estimation of causal effects from potentially incomplete or non-standard empirical studies. It is distinguished by its use of graphical models (augmented causal DAGs) and structural assignments that account for both the substantive and design-induced features of empirical data, supporting the direct application of causal calculus even in complex settings such as case-control studies, trials with noncompliance, or surveys with structured missingness (Karvanen, 2012, Karvanen, 2014).
1. Formal Foundations: Graphical Representation and Structural Equations
A causal model with design is a nonparametric SCM augmented by population nodes, selection nodes, and data nodes corresponding to specific sampling and measurement mechanisms. Let
- denote independent exogenous variables;
- the endogenous (causal) variables of substantive interest;
- binary selection (e.g., sampling or missingness indicator) nodes;
- measurement or data nodes, where each equals when , NA otherwise.
The full SCM is specified as , where for each , there is a structural equation , and for each selector , , with (Karvanen, 2012). The observed data are determined via structural assignments at data stage: if , otherwise NA.
Graphically, nodes are positioned along two axes: the causal order (parents to the left of children) and the time or stage at which each variable is measured or determined. This two-dimensional layout directly visualizes the flow from population to data and the nature/timing of missing or selection events.
2. Study Design Mechanisms and Missing Data Structures
Causal models with design systematically encode the mechanisms underlying complex sampling and missing data:
- Selection nodes represent sampling into the study, participation, or missingness events, and are structured according to the design (e.g., case-control, risk-set sampling).
- Data nodes distinguish between latent (true) variables and observed/measured values. For example, in a case-control study, is observed only for those sampled as cases or controls (Karvanen, 2012).
- The fully specified SCM guides which variables' values are observed under each realized selection and measurement path, enabling transparent analysis of which relationships are estimable, given possibly incomplete data.
Missing data mechanisms (MCAR, MAR, MNAR) are translated into structural equations and DAG paths involving the selection nodes. For instance, if , then the missingness of may be missing-not-at-random (MNAR). If the only connection between and its selector is through observed data nodes, the mechanism is MAR or MCAR (Karvanen, 2012, Karvanen, 2014).
3. Identification, Causal Calculus, and Do-Calculus Application
The primary utility of the causal model with design is in providing a graphical and algebraic foundation for applying causal calculus to observational or experiment-derived data, often with incomplete observation patterns. Once the full design-augmented DAG is drawn:
- Identifiability questions are resolved using d-separation and the ID algorithm after collapsing or marginalizing unneeded measurement nodes.
- Standard do-calculus rules (action/observation exchange, intervention/observation exchange, and adjustment) apply to the augmented graph. Crucially, the identifiability of an effect such as is assessed after accounting for missingness and sampling as encoded by selection nodes (Karvanen, 2012, Karvanen, 2014).
For example, in a case-control design, estimation of causal effects requires replacing each conditional in the front-door or back-door formula with its appropriately weighted or design-corrected empirical estimate, reflecting the selection structure.
4. Practical Workflows and Statistical Estimation
The structure afforded by the model leads to a unified estimation workflow, even in the presence of nonlinear dependencies and missing data:
- Qualitative causal structure is specified (possibly via expert elicitation and data-driven tests).
- The design DAG is augmented per sampling and measurement protocol.
- Identification is verified graphically; the appropriate do-calculus formula is derived.
- Missing data are handled by design-appropriate procedures—e.g., multiple imputation under MAR assumptions guided by the d-separation relationships in the graph (Karvanen, 2014).
- Causal effects are estimated by plugging nonparametric or semiparametric estimates (e.g., generalized additive models) into the identified formulas (such as the front-door estimator).
- Results are combined across imputations and reported with uncertainty quantification per Rubin's rules or other design-based approaches.
This approach is equally applicable in classical epidemiological studies, clinical trials with complex missingness, or process monitoring settings with arbitrary assignment or measurement mechanisms.
5. Design-Based Paradigm: Assignment Mechanisms and Estimands
Causal models with design relate closely to the design-based approach in causal inference, which emphasizes that all randomness arises from the assignment or sampling mechanism, not from superpopulation assumptions (Lu et al., 27 Nov 2025, Aronow et al., 15 May 2025, Heng et al., 2023). Key concepts include:
- The distinction between SUTVA (Stable Unit Treatment Value Assumption)—implying no interference and no hidden treatment versions—and the weaker “No Unmodeled Revealable Variation Assumption” (NURVA), which only constrains potential outcomes over the support of the design (Aronow et al., 15 May 2025).
- Inferential targets such as the Average Treatment Effect (ATE) are defined as functionals of the assignment distribution and potential outcomes; in designs with interference, alternative exposures and associated estimands (e.g., Average Expected Exposure Difference) are defined in terms of the exposure mapping. For each unit, potential outcomes are associated with assignments, exposures, and possibly more elaborate versions if interference or hidden versions are present.
Design-based randomization tests, covariate adjustment, and finite-population inference with missing outcomes are all naturally embedded in the causal model with design framework, as the assignment and measurement mechanisms are made explicit and estimable from the augmented DAG and structural equations (Heng et al., 2023).
6. Application Domains and Impact
Causal models with design have been extensively applied to case-control, nested case-control, and two-stage case-cohort studies, as well as randomized trials with noncompliance and observational studies of complex systems (Karvanen, 2012, Karvanen, 2014). Key advantages include:
- Systematic handling of selection and measurement processes,
- Unified identifiability analysis via extended DAGs,
- Transparent communication of all causal and design assumptions,
- Direct applicability of do-calculus in non-standard designs,
- Efficient and unbiased estimation via likelihood factorization or design-based weighted estimating equations,
- Clear graphical criteria for identifiability even in the presence of structured missingness or complex sampling.
Examples include case-control analyses where causality is non-identifiable unless certain conditions are met, nested case-control and two-stage case-cohort studies analyzed via risk-set adjustment, and trials with noncompliance analyzed via intent-to-treat indices extracted graphically.
7. Best Practices and Recommendations
The consensus from the literature suggests the following best practices:
- Draw the full two-dimensional DAG encoding both causal order and observation/sampling times.
- Indicate all observed, missing, unobserved, and determined nodes, clearly distinguishing between latent variables, selection variables, and measured data.
- Use the DAG to derive the identification formula, referencing back-door, front-door, or g-formula as appropriate for the study design.
- In estimation, follow the factorization of the full likelihood from the model; use semiparametric or nonparametric regression models as dictated by variable types and nonlinearities.
- Ensure that the final estimation and inference pipeline is consistent with the augmented structure; conduct sensitivity analyses for unmeasured confounding or nonignorable missingness as necessary (Karvanen, 2012, Karvanen, 2014, Heng et al., 2023, Aronow et al., 15 May 2025).
In sum, causal models with design provide a rigorously unified paradigm for representing, analyzing, and communicating the conjunction of scientific causal hypotheses and empirical design features in modern empirical research, supporting both methodological clarity and robust statistical inference across a range of scientific disciplines.