PCA-QS: PCA Guided Quantile Sampling
- The paper introduces PCA-QS, a method that combines PCA with quantile stratified sampling for efficient, structure-preserving data reduction.
- It employs PCA projection to guide quantile binning and stratified subsampling, maintaining the original data distribution for improved modeling fidelity.
- Empirical benchmarks demonstrate that PCA-QS achieves competitive regression and clustering performance with minimal accuracy loss against full dataset analysis.
Principal Component Analysis guided Quantile Sampling (PCA-QS) is a family of data reduction algorithms designed to efficiently subsample large datasets while preserving their statistical and geometric structure. The method combines principal component analysis (PCA) with quantile-based stratified sampling, thus enabling both computational efficiency and high fidelity across a range of modeling tasks. Unlike traditional dimensionality reduction, PCA-QS leverages the principal components solely for guiding stratification, preserving the dataset in its original feature space. The approach has received detailed formalization and empirical validation in recent work (Hui-Mean et al., 10 Jan 2026, Hui-Mean et al., 23 Jun 2025).
1. Formal Problem Setup and Notation
Let be a column-centered data matrix (optionally standardized), with observations and features. The goal is to produce a subsample of size (retention rate or ) such that the subsample both represents the full data distribution and is amenable to large-scale statistical modeling (e.g., regression, clustering).
PCA preprocessing involves constructing the empirical covariance , computing the eigen-decomposition , and selecting the top components and such that . Each row gives the projection of sample onto the leading principal axes.
2. Algorithmic Framework and Quantile Stratification
PCA-QS performs data reduction in four core steps:
- PCA projection: Compute the top- principal components via SVD or eigen-decomposition and project all samples: .
- Quantile thresholding: Partition each PC axis into bins using empirical quantiles , with .
- Bin assignment: Assign each sample a composite quantile index ; encodes which quantile bin the th sample falls into along the th PC axis.
- Stratified sampling: For each composite bin , randomly sample points from group .
Sample inclusion probabilities are defined as for in bin . Sampling may be performed by multinomial draws, independent weighted sampling, or systematic rounding to guarantee exactly points.
The method preserves the empirical distribution by ensuring all regions of the projected PC space are represented. Unlike conventional PCA, the original feature space is retained—the principal components are only used as a stratification guide (Hui-Mean et al., 23 Jun 2025).
3. Theoretical Properties
Under the assumption that are i.i.d. samples from a distribution on , PCA-QS achieves the following distributional guarantees (Hui-Mean et al., 23 Jun 2025):
- Quantile consistency: Empirical multivariate quantiles converge at rate .
- Combined MSE: Distributional error for measures such as or Hellinger decays as .
- KL divergence: With suitable kernel density smoothing, KL divergence between PCA-QS and scales as .
- Wasserstein distance: ; with effective convergence governed by .
These rates reflect trade-offs between the bias from projecting to PCs, quantization error from bins, and sampling variance. Under Gaussian assumptions , stratification ensures nearly unbiased coverage of PC directions; the subsample converges in distribution to the full data as (Hui-Mean et al., 10 Jan 2026).
4. Computational Complexity and Implementation
The computational cost of PCA-QS is dominated by PCA computation and quantile thresholding:
- PCA (exact SVD): .
- Projection: .
- Quantile sorting: for quantile cutpoints.
- Bin assignment: via binary search on quantile cutpoints.
- Group sampling: .
Total runtime is for randomized PCA algorithms. Streaming quantile algorithms and incremental PCA reduce time and space costs further (Hui-Mean et al., 23 Jun 2025, Hui-Mean et al., 10 Jan 2026). For typical use, parameter settings with moderate and are recommended, unless extreme class imbalance or high data dimensionality necessitates finer stratification.
5. Empirical Benchmarks and Comparative Analysis
Rigorous empirical studies on synthetic and real-world benchmarks demonstrate:
- Superior distributional fidelity: Metrics such as KL divergence, energy distance, Mahalanobis distance, and maximum mean discrepancy (MMD) consistently yield lower values for PCA-QS vs. uniform random sampling (SRS), leverage-score, or coreset methods at matched retention rates.
- Regression fidelity: Mean squared error for linear regression fits on PCA-QS samples is (SRS: $1.074$, leverage: $1.072$, coreset: $1.072$; full data: $1.000$), indicating a loss of only accuracy relative to full data while maintaining high structure fidelity (Hui-Mean et al., 10 Jan 2026).
- Scalability and runtime: PCA-QS achieves practical runtimes (e.g., $0.21$ s per sample set for , , –$5$) compared to SRS ($0.015$ s), leverage ($0.09$ s), and coreset ($4.6$ s).
- Clustering: On CoverType (7 classes, ), PCA-QS subsample k-means silhouette scores match full-data values within $0.05$ for of runs and within $0.1$ for .
Selected table excerpt for a sample of UCI CreditCard (, , , ):
| Metric | PCA-QS (mean±std) | SRS (mean±std) |
|---|---|---|
| Jensen–Shannon | 0.0000 (0.0000) | 0.0308 (0.0203) |
| Energy distance | 0.0660 (0.0056) | 4.1291 (0.5101) |
| KL divergence | 0.0332 (0.0040) | 17.1552 (0.1638) |
| MMD | 0.0067 (0.0000) | 0.4000 (0.0000) |
| Mahalanobis distance | 9.1997 (0.0029) | 8.9671 (0.4555) |
(Hui-Mean et al., 23 Jun 2025)
6. Parameter Selection and Practical Deployment
Parameter choices for PCA-QS are determined by variance retention, spectral gap in eigenvalues, and quantization–projection error trade-offs:
- PC count : Select so that cumulative variance captured exceeds a threshold (e.g., $0.70$–$0.95$). Scree plots and spectral gap heuristics are effective (Hui-Mean et al., 10 Jan 2026, Hui-Mean et al., 23 Jun 2025).
- Quantile bins : Default for moderate ; higher increases fidelity but leads to bin sparsity for large or small . Bin merging, oversampling, or uniform fallback are logically indicated in sparse regions.
- Retention rate (): Optimal ranges are $0.01$–$0.10$, with lower values yielding increased computational savings at modest fidelity cost.
Sparse bin handling uses bin merging, internal oversampling, or fallback to SRS as appropriate. Parameter cross-validation using empirical regression MSE or distributional distances can refine settings (Hui-Mean et al., 23 Jun 2025).
7. Extensions and Applications
PCA-QS is adaptable to a variety of data modalities and subsampling objectives:
- Experimental design: Stratification can be extended with adaptive experimental-design objectives within bins.
- Active learning: Uncertainty-based subsampling may replace uniform selection inside bins.
- Large-scale regression, clustering, and classification: Demonstrated efficacy for high-dimensional tabular, time-series, and text/NLP data.
The method's emphasis on interpretable, structure-preserving reduction makes it well-suited for summarizing datasets prior to downstream modeling, benchmarking, or exploratory analysis. Its empirical and theoretical properties indicate broad applicability in statistical computing and machine learning workflows (Hui-Mean et al., 10 Jan 2026, Hui-Mean et al., 23 Jun 2025).