Stratified Propensity Score Analysis

Updated 27 January 2026

Stratified Propensity Score Analysis is a method that partitions observational data into strata based on estimated propensity scores, ensuring balanced covariate distributions for unbiased treatment effect estimation.
It employs quantile-based and Bayesian stratification techniques to aggregate stratum-specific effects through weighted averages, optimizing bias control and statistical efficiency.
Extensions of the method include adjustments for continuous exposures, merging of RCT and observational data, and strategies for handling covariate shifts in supervised learning.

Stratified Propensity Score Analysis is an established methodology in causal inference for observational studies, experimental generalization, and certain supervised learning problems involving covariate shift. The core concept is to partition data into strata defined by the estimated propensity score (the probability of treatment assignment or sample membership given covariates), within which treated and control units have similar covariate distributions. This approach facilitates bias reduction, variance control, computational tractability, and interpretable effect estimation, and is extensible to continuous exposures, risk-adjusted merging of datasets, and robust subgroup analysis.

1. Fundamental Concepts and Formal Framework

Let $i=1, \dots, n$ index observational units with covariates $X_i$ , binary treatment indicator $T_i \in \{0,1\}$ , and outcome $Y_i$ . The propensity score is $e(x) = \Pr(T=1\,|\,X=x)$ (Aikens et al., 2020). Conditioning on $e(X)$ balances measured covariates, ensuring that within each stratum of $e(X)$ , the distribution of $X$ is similar between treated and control units (Rosenbaum & Rubin, 1983).

Stratification divides the sample into $K$ disjoint strata $\mathcal{S}_1, \dots, \mathcal{S}_K$ by quantiles of the estimated propensity score. Within each stratum, the treatment assignment can be regarded as approximately randomized, enabling unbiased estimation of stratum-specific average treatment effects (ATE). The overall ATE is aggregated via a weighted average: $\hat \tau_{strat} = \sum_{k=1}^K w_k\,\hat \tau_k$ where $\hat \tau_k$ is the within-stratum ATE and $w_k$ is a weight (e.g., proportional to stratum size, estimates ATE; proportional to treated count, estimates ATT) (Aikens et al., 2020, Poletto et al., 2024).

For continuous exposures, the generalized propensity score (GPS) enables stratification for dose-response estimation by partitioning on the predicted exposure model index; within stratum, OLS is used to estimate linear outcome models, and strata-specific coefficient estimates are pooled (Garès et al., 2020).

2. Stratification Algorithms and Implementations

Strata are commonly constructed as quantiles of the estimated propensity scores (e.g., quintiles, deciles), yielding approximately equal-size blocks (Aikens et al., 2020, Wallin et al., 2024, Poletto et al., 2024). For stratamatch’s pilot design, a subset of control units is held out to train a prognostic score model $s(x) = E[Y | T=0, X=x]$ , which is then used to stratify the analysis set, thereby minimizing overfitting to the analysis sample (Aikens et al., 2020).

Algorithm steps:

Estimate the propensity (or prognostic) score via logistic regression or machine learning.
Split the sample into $K$ strata by quantiles of the score.
Within each stratum, compare outcomes between treated and control units (means, matched-pair differences, regression).
Aggregate the stratum estimates to obtain the overall effect.
Calculate variance using standard formulas: $\operatorname{Var}(\hat{\tau}_{strat}) \approx \sum_{k=1}^{K} w_k^2\,\operatorname{Var}(\hat\tau_k)$ where within-stratum variance for mean difference is $\sigma_{k1}^2 / n_{k1} + \sigma_{k0}^2 / n_{k0}$ (Aikens et al., 2020, Poletto et al., 2024).

For GPS stratification (continuous $T$ ), stratify on fitted values from the exposure model and fit outcome regressions within each stratum, pooling coefficients by stratum size (Garès et al., 2020).

Subgroup-specific propensity score analysis requires explicit balancing constraints on covariates within each pre-specified subgroup, accomplished by joint moment equations in the CBPS or GMM framework (Li et al., 2024).

3. Statistical Efficiency, Bias, and Variance Properties

Under a semiparametric potential outcomes framework, stratification induces efficiency gains by reducing heterogeneity of the conditional mean function within each stratum. The efficiency improvement from parametric stratification decreases as the partitioning becomes finer, since conditional mean variation within each stratum is reduced (Kono, 2023). Explicit formulas for efficiency gain are: $\Delta^{uk \rightarrow p} = \sum_{k=1}^{K} \pi_k\,e_k(1-e_k)\,\operatorname{Var}(\mu_1(X)-\mu_0(X)\,|\,\mathcal{X}_k)$ As $K$ increases and strata become fine, gains vanish, and nonparametric estimators recover maximal efficiency (Kono, 2023).

Monte Carlo studies show that stratified estimators:

Achieve negligible bias when confounding is well-adjusted and model is appropriately parameterized.
Have stable, conservative variance estimates under closed-form pooled linearized or model-based formulas, but optimal variance estimation is achieved via the nonparametric bootstrap, which accounts for score estimation and strata cutpoint uncertainty (Garès et al., 2020).
Are robust to moderate model misspecification, provided balance diagnostics are monitored (Poletto et al., 2024).

Quintile-based stratification ( $K=5$ ) typically removes $>90\%$ bias for a wide range of settings (Cochran 1968), but increasing $K$ further reduces bias at the cost of increased within-stratum variance (Poletto et al., 2024, Orihara et al., 2024).

4. Extensions and Advanced Stratification Designs

Prognostic Score and Dual Stratification

Stratification can precede or complement propensity-score matching by stratifying first on a prognostic score $s(x)$ ; within each block, treated and control units then are matched or compared on $e(x)$ (Aikens et al., 2020). The result is increased outcome homogeneity and reduced within-block variance, improving statistical efficiency and sensitivity to unmeasured confounding (Aikens et al., 2020).

Wijayatunga’s analysis demonstrates that greatest dimension reduction in the confounders is achieved by jointly stratifying on both propensity and outcome scores. Merging strata according to joint balancing-score equalities minimizes necessary condition set cardinality, yet preserves unbiased effect estimation (Wijayatunga, 2018).

Bayesian Stratification

Bayesian approaches integrate design-phase uncertainty in the number of strata $K$ , strata boundaries, and score estimation. General Bayesian procedures (Gibbs posterior, RJ-MCMC) provide posterior inference for ATE while averaging over uncertainty in $K$ and stratification (Orihara et al., 2024, Liao et al., 2018). Posterior credible intervals reflect stratification and model estimation uncertainty, greatly improving coverage properties, especially in finite samples or poor overlap regimes (Orihara et al., 2024).

For quantile-stratified Bayesian propensity score analysis, posterior draws of the score model induce multiple possible stratifications, which are propagated to the effect estimation stage, yielding full posterior for the effect and diagnostics for design-stage sensitivity (Liao et al., 2018).

Subgroup Guaranteed Balance

New stratification algorithms, such as G-SBPS (and kernelized kG-SBPS), enforce covariate mean balance within all subgroups simultaneously by embedding subgroup indicators and interactions within the balancing moment equations. This structure guarantees subgroup balance directly, improves subgroup effect estimation under model misspecification, and leverages nonparametric kernel bases for higher-dimensional functional class balance (Li et al., 2024).

5. Practical Considerations, Diagnostics, and Computational Aspects

Stratification is computationally efficient, scaling linearly in sample size when strata are constructed to have moderate size (typically hundreds to a thousand units per block) (Aikens et al., 2020). This enables optimal matching or regression adjustment within blocks for large observational datasets.

Choice of $K$ is governed by the bias-variance trade-off and ensured cell sizes; each stratum should contain sufficient treated and control units for stable estimation, with five strata being robust in most applications (Poletto et al., 2024, Aikens et al., 2020, Chan, 2021).

Covariate balance should be checked after stratification within each stratum—absolute standardized mean differences (ASMD $<0.1$ ) are considered acceptable (Wallin et al., 2024, Aikens et al., 2020). Plotting propensity-score densities and ASMD tables for each covariate within stratum aids overlap diagnostics and informs reconsideration of stratification or the propensity model.

For sensitivity to unmeasured confounding, stratification on prognostic scores (or outcome scores) increases design sensitivity in Rosenbaum’s framework; strata with lower within-block variance require larger bias parameters ( $\Gamma$ ) to overturn significant effects (Aikens et al., 2020).

6. Extensions: Generalization, Covariate Shift, Dose–Response, and Data Merging

Generalization via Bound Tightening

Stratified propensity score methods narrow Manski-style worst-case bounds for population ATE by constructing stratum-specific bounds and aggregating, with precision gain a function of overlap $Q$ between sample and population score distributions (Chan, 2021). For moderate overlap ( $Q = 0.5$ –$0.75$), bound widths can be reduced by $25$– $45\%$ .

Covariate Shift in Supervised Learning

Stratified propensity score learning improves target-domain prediction under covariate shift by training separate models within score-defined blocks, removing bias from distributional mismatch and outperforming importance weighting in high dimensions (Autenrieth et al., 2021). Each stratum acts as a local domain adaptation region with balanced covariates.

Dose–Response and Quantitative Exposure

For continuous exposures, stratified GPS estimators partition by exposure model index, fit outcome regressions in each block, and pool coefficients to estimate dose–response; bootstrap variance-estimation is recommended for inference (Garès et al., 2020).

Merging Observational and Experimental Data

Stratified analysis supports merging RCT and ODB data by assigning both to shared propensity-score blocks, utilizing “spike-in” or dynamic-weighted convex combination estimators that balance bias (ODB) versus variance (RCT) within stratum (Rosenman et al., 2018). Risk-adjusted strata (including prognostic scores) further enhance efficiency and bias robustness.

7. Application Domains and Recent Empirical Evidence

Stratified propensity score analysis is employed in biomedical observational studies, education assessment (test equating without anchor tests) (Wallin et al., 2024), sports analytics (e.g., baseball pitching strategy evaluation) (Nakahara et al., 2022), large-scale program generalization, and cosmology/domain adaptation tasks (Autenrieth et al., 2021). Simulation studies and real-data evaluations confirm robust bias reduction, scalability, and enhanced subgroup effect validity for both classical and modern variants.

Key simulation findings:

Stratification is conceptually transparent, maintains real data units, allows inspection of effect heterogeneity, and is robust to moderate misspecification.
Bias and coverage are near-nominal for carefully constructed strata; limitations include potential residual confounding from insufficient overlap or insufficient stratum size.
Bayesian and kernelized extensions address uncertainty and functional misspecification robustly.

Empirical summaries:

In test equating, stratification closely approximates local equity by conditioning on covariate-based scores (Wallin et al., 2024).
In clinical subgroup analysis, kernelized guaranteed balance improves inference under model misspecification (Li et al., 2024).
In RCT/OBS data merging, spike-in and dual-stratified methods minimize RMSE under distributional match (Rosenman et al., 2018).

Stratified propensity score analysis, in its various forms, is a foundational tool for unbiased treatment effect estimation in non-randomized studies, efficient generalization, robust subgroup inference, and machine learning under covariate shift. Proper model specification, strata construction, and rigorous diagnostics are critical to achieving its theoretical and practical advantages.