PCA-QS: PCA Guided Quantile Sampling

Updated 17 January 2026

The paper introduces PCA-QS, a method that combines PCA with quantile stratified sampling for efficient, structure-preserving data reduction.
It employs PCA projection to guide quantile binning and stratified subsampling, maintaining the original data distribution for improved modeling fidelity.
Empirical benchmarks demonstrate that PCA-QS achieves competitive regression and clustering performance with minimal accuracy loss against full dataset analysis.

Principal Component Analysis guided Quantile Sampling (PCA-QS) is a family of data reduction algorithms designed to efficiently subsample large datasets while preserving their statistical and geometric structure. The method combines principal component analysis (PCA) with quantile-based stratified sampling, thus enabling both computational efficiency and high fidelity across a range of modeling tasks. Unlike traditional dimensionality reduction, PCA-QS leverages the principal components solely for guiding stratification, preserving the dataset in its original feature space. The approach has received detailed formalization and empirical validation in recent work (Hui-Mean et al., 10 Jan 2026, Hui-Mean et al., 23 Jun 2025).

1. Formal Problem Setup and Notation

Let $X \in \mathbb{R}^{n\times d}$ be a column-centered data matrix (optionally standardized), with $n$ observations and $d$ features. The goal is to produce a subsample of size $m \ll n$ (retention rate $\delta = m/n$ or $s$ ) such that the subsample both represents the full data distribution and is amenable to large-scale statistical modeling (e.g., regression, clustering).

PCA preprocessing involves constructing the empirical covariance $S = (1/n) X^\top X$ , computing the eigen-decomposition $S = V \Lambda V^\top$ , and selecting the top $k$ components $V_k$ and $\Lambda_k$ such that $Z = X V_k \in \mathbb{R}^{n \times k}$ . Each row $Z_i \in \mathbb{R}^k$ gives the projection of sample $i$ onto the leading principal axes.

2. Algorithmic Framework and Quantile Stratification

PCA-QS performs data reduction in four core steps:

PCA projection: Compute the top- $k$ principal components via SVD or eigen-decomposition and project all samples: $Z = X V_k$ .
Quantile thresholding: Partition each PC axis $j=1,\dots,k$ into $g$ bins using empirical quantiles $Q_j(u)$ , with $u \in \{0, 1/g, 2/g, \dots, 1\}$ .
Bin assignment: Assign each sample $i$ a composite quantile index $q_i = (q_{i1},\dots,q_{ik})$ ; $q_{ij}$ encodes which quantile bin the $i$ th sample falls into along the $j$ th PC axis.
Stratified sampling: For each composite bin $b$ , randomly sample $m_b = \min(\lceil \delta |\mathcal{B}_b| \rceil, |\mathcal{B}_b|)$ points from group $\mathcal{B}_b$ .

Sample inclusion probabilities are defined as $p_i = \delta / |\mathcal{B}_{q_i}|$ for $i$ in bin $\mathcal{B}_{q_i}$ . Sampling may be performed by multinomial draws, independent weighted sampling, or systematic rounding to guarantee exactly $m$ points.

The method preserves the empirical distribution by ensuring all regions of the projected PC space are represented. Unlike conventional PCA, the original feature space is retained—the principal components are only used as a stratification guide (Hui-Mean et al., 23 Jun 2025).

3. Theoretical Properties

Under the assumption that $X_i$ are i.i.d. samples from a distribution $P$ on $\mathbb{R}^d$ , PCA-QS achieves the following distributional guarantees (Hui-Mean et al., 23 Jun 2025):

Quantile consistency: Empirical multivariate quantiles converge at rate $O_p(n^{-1/2})$ .
Combined MSE: Distributional error for measures such as $L_2$ or Hellinger decays as $O(k^{-2} + Q^{-2} + n^{-1})$ .
KL divergence: With suitable kernel density smoothing, KL divergence between PCA-QS and $P$ scales as $O(n^{-4/(k+4)})$ .
Wasserstein distance: $W_2(P_{n,k,Q}, P) = O(n^{-1/d})$ ; with effective convergence governed by $k$ .

These rates reflect trade-offs between the bias from projecting to $k$ PCs, quantization error from $Q$ bins, and sampling variance. Under Gaussian assumptions $X \sim \mathcal{N}(0,\Sigma)$ , stratification ensures nearly unbiased coverage of PC directions; the subsample converges in distribution to the full data as $\delta\rightarrow 1$ (Hui-Mean et al., 10 Jan 2026).

4. Computational Complexity and Implementation

The computational cost of PCA-QS is dominated by PCA computation and quantile thresholding:

PCA (exact SVD): $O(n d k)$ .
Projection: $O(n d k)$ .
Quantile sorting: $O(k n \log n)$ for quantile cutpoints.
Bin assignment: $O(n k \log g)$ via binary search on quantile cutpoints.
Group sampling: $O(n)$ .

Total runtime is $O(n d k + k n \log n)$ for randomized PCA algorithms. Streaming quantile algorithms and incremental PCA reduce time and space costs further (Hui-Mean et al., 23 Jun 2025, Hui-Mean et al., 10 Jan 2026). For typical use, parameter settings with moderate $k$ and $g$ are recommended, unless extreme class imbalance or high data dimensionality necessitates finer stratification.

5. Empirical Benchmarks and Comparative Analysis

Rigorous empirical studies on synthetic and real-world benchmarks demonstrate:

Superior distributional fidelity: Metrics such as KL divergence, energy distance, Mahalanobis distance, and maximum mean discrepancy (MMD) consistently yield lower values for PCA-QS vs. uniform random sampling (SRS), leverage-score, or coreset methods at matched retention rates.
Regression fidelity: Mean squared error for linear regression fits on PCA-QS samples is $\approx 1.075$ (SRS: $1.074$, leverage: $1.072$, coreset: $1.072$; full data: $1.000$), indicating a loss of only $7.5\%$ accuracy relative to full data while maintaining high structure fidelity (Hui-Mean et al., 10 Jan 2026).
Scalability and runtime: PCA-QS achieves practical runtimes (e.g., $0.21$ s per sample set for $n=10,000$ , $d=500$ , $k=2$ –$5$) compared to SRS ($0.015$ s), leverage ($0.09$ s), and coreset ($4.6$ s).
Clustering: On CoverType (7 classes, $\delta=0.2$ ), PCA-QS subsample k-means silhouette scores match full-data values within $0.05$ for $88.9\%$ of runs and within $0.1$ for $99.5\%$ .

Selected table excerpt for a $5\%$ sample of UCI CreditCard ( $n=30{,}000$ , $d=23$ , $k=10$ , $Q=10$ ):

Metric	PCA-QS (mean±std)	SRS (mean±std)
Jensen–Shannon	0.0000 (0.0000)	0.0308 (0.0203)
Energy distance	0.0660 (0.0056)	4.1291 (0.5101)
KL divergence	0.0332 (0.0040)	17.1552 (0.1638)
MMD	0.0067 (0.0000)	0.4000 (0.0000)
Mahalanobis distance	9.1997 (0.0029)	8.9671 (0.4555)

(Hui-Mean et al., 23 Jun 2025)

6. Parameter Selection and Practical Deployment

Parameter choices for PCA-QS are determined by variance retention, spectral gap in eigenvalues, and quantization–projection error trade-offs:

PC count $k$ : Select $k$ so that cumulative variance captured $\rho_k = (\sum_{j=1}^{k} \lambda_j)/(\sum_{j=1}^d \lambda_j)$ exceeds a threshold (e.g., $0.70$–$0.95$). Scree plots and spectral gap heuristics are effective (Hui-Mean et al., 10 Jan 2026, Hui-Mean et al., 23 Jun 2025).
Quantile bins $g$ : Default $g=5$ for moderate $k$ ; higher $g$ increases fidelity but leads to bin sparsity for large $k$ or small $n$ . Bin merging, oversampling, or uniform fallback are logically indicated in sparse regions.
Retention rate $\delta$ ( $s$ ): Optimal ranges are $0.01$–$0.10$, with lower values yielding increased computational savings at modest fidelity cost.

Sparse bin handling uses bin merging, internal oversampling, or fallback to SRS as appropriate. Parameter cross-validation using empirical regression MSE or distributional distances can refine settings (Hui-Mean et al., 23 Jun 2025).

7. Extensions and Applications

PCA-QS is adaptable to a variety of data modalities and subsampling objectives:

Experimental design: Stratification can be extended with adaptive experimental-design objectives within bins.
Active learning: Uncertainty-based subsampling may replace uniform selection inside bins.
Large-scale regression, clustering, and classification: Demonstrated efficacy for high-dimensional tabular, time-series, and text/NLP data.

The method's emphasis on interpretable, structure-preserving reduction makes it well-suited for summarizing datasets prior to downstream modeling, benchmarking, or exploratory analysis. Its empirical and theoretical properties indicate broad applicability in statistical computing and machine learning workflows (Hui-Mean et al., 10 Jan 2026, Hui-Mean et al., 23 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Efficient Data Reduction Via PCA-Guided Quantile Based Sampling (2026)

PCA-Guided Quantile Sampling: Preserving Data Structure in Large-Scale Subsampling (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Principal Component Analysis guided Quantile Sampling (PCA-QS).