Papers
Topics
Authors
Recent
Search
2000 character limit reached

PCA-QS: PCA Guided Quantile Sampling

Updated 17 January 2026
  • The paper introduces PCA-QS, a method that combines PCA with quantile stratified sampling for efficient, structure-preserving data reduction.
  • It employs PCA projection to guide quantile binning and stratified subsampling, maintaining the original data distribution for improved modeling fidelity.
  • Empirical benchmarks demonstrate that PCA-QS achieves competitive regression and clustering performance with minimal accuracy loss against full dataset analysis.

Principal Component Analysis guided Quantile Sampling (PCA-QS) is a family of data reduction algorithms designed to efficiently subsample large datasets while preserving their statistical and geometric structure. The method combines principal component analysis (PCA) with quantile-based stratified sampling, thus enabling both computational efficiency and high fidelity across a range of modeling tasks. Unlike traditional dimensionality reduction, PCA-QS leverages the principal components solely for guiding stratification, preserving the dataset in its original feature space. The approach has received detailed formalization and empirical validation in recent work (Hui-Mean et al., 10 Jan 2026, Hui-Mean et al., 23 Jun 2025).

1. Formal Problem Setup and Notation

Let XRn×dX \in \mathbb{R}^{n\times d} be a column-centered data matrix (optionally standardized), with nn observations and dd features. The goal is to produce a subsample of size mnm \ll n (retention rate δ=m/n\delta = m/n or ss) such that the subsample both represents the full data distribution and is amenable to large-scale statistical modeling (e.g., regression, clustering).

PCA preprocessing involves constructing the empirical covariance S=(1/n)XXS = (1/n) X^\top X, computing the eigen-decomposition S=VΛVS = V \Lambda V^\top, and selecting the top kk components VkV_k and Λk\Lambda_k such that Z=XVkRn×kZ = X V_k \in \mathbb{R}^{n \times k}. Each row ZiRkZ_i \in \mathbb{R}^k gives the projection of sample ii onto the leading principal axes.

2. Algorithmic Framework and Quantile Stratification

PCA-QS performs data reduction in four core steps:

  1. PCA projection: Compute the top-kk principal components via SVD or eigen-decomposition and project all samples: Z=XVkZ = X V_k.
  2. Quantile thresholding: Partition each PC axis j=1,,kj=1,\dots,k into gg bins using empirical quantiles Qj(u)Q_j(u), with u{0,1/g,2/g,,1}u \in \{0, 1/g, 2/g, \dots, 1\}.
  3. Bin assignment: Assign each sample ii a composite quantile index qi=(qi1,,qik)q_i = (q_{i1},\dots,q_{ik}); qijq_{ij} encodes which quantile bin the iith sample falls into along the jjth PC axis.
  4. Stratified sampling: For each composite bin bb, randomly sample mb=min(δBb,Bb)m_b = \min(\lceil \delta |\mathcal{B}_b| \rceil, |\mathcal{B}_b|) points from group Bb\mathcal{B}_b.

Sample inclusion probabilities are defined as pi=δ/Bqip_i = \delta / |\mathcal{B}_{q_i}| for ii in bin Bqi\mathcal{B}_{q_i}. Sampling may be performed by multinomial draws, independent weighted sampling, or systematic rounding to guarantee exactly mm points.

The method preserves the empirical distribution by ensuring all regions of the projected PC space are represented. Unlike conventional PCA, the original feature space is retained—the principal components are only used as a stratification guide (Hui-Mean et al., 23 Jun 2025).

3. Theoretical Properties

Under the assumption that XiX_i are i.i.d. samples from a distribution PP on Rd\mathbb{R}^d, PCA-QS achieves the following distributional guarantees (Hui-Mean et al., 23 Jun 2025):

  • Quantile consistency: Empirical multivariate quantiles converge at rate Op(n1/2)O_p(n^{-1/2}).
  • Combined MSE: Distributional error for measures such as L2L_2 or Hellinger decays as O(k2+Q2+n1)O(k^{-2} + Q^{-2} + n^{-1}).
  • KL divergence: With suitable kernel density smoothing, KL divergence between PCA-QS and PP scales as O(n4/(k+4))O(n^{-4/(k+4)}).
  • Wasserstein distance: W2(Pn,k,Q,P)=O(n1/d)W_2(P_{n,k,Q}, P) = O(n^{-1/d}); with effective convergence governed by kk.

These rates reflect trade-offs between the bias from projecting to kk PCs, quantization error from QQ bins, and sampling variance. Under Gaussian assumptions XN(0,Σ)X \sim \mathcal{N}(0,\Sigma), stratification ensures nearly unbiased coverage of PC directions; the subsample converges in distribution to the full data as δ1\delta\rightarrow 1 (Hui-Mean et al., 10 Jan 2026).

4. Computational Complexity and Implementation

The computational cost of PCA-QS is dominated by PCA computation and quantile thresholding:

  • PCA (exact SVD): O(ndk)O(n d k).
  • Projection: O(ndk)O(n d k).
  • Quantile sorting: O(knlogn)O(k n \log n) for quantile cutpoints.
  • Bin assignment: O(nklogg)O(n k \log g) via binary search on quantile cutpoints.
  • Group sampling: O(n)O(n).

Total runtime is O(ndk+knlogn)O(n d k + k n \log n) for randomized PCA algorithms. Streaming quantile algorithms and incremental PCA reduce time and space costs further (Hui-Mean et al., 23 Jun 2025, Hui-Mean et al., 10 Jan 2026). For typical use, parameter settings with moderate kk and gg are recommended, unless extreme class imbalance or high data dimensionality necessitates finer stratification.

5. Empirical Benchmarks and Comparative Analysis

Rigorous empirical studies on synthetic and real-world benchmarks demonstrate:

  • Superior distributional fidelity: Metrics such as KL divergence, energy distance, Mahalanobis distance, and maximum mean discrepancy (MMD) consistently yield lower values for PCA-QS vs. uniform random sampling (SRS), leverage-score, or coreset methods at matched retention rates.
  • Regression fidelity: Mean squared error for linear regression fits on PCA-QS samples is 1.075\approx 1.075 (SRS: $1.074$, leverage: $1.072$, coreset: $1.072$; full data: $1.000$), indicating a loss of only 7.5%7.5\% accuracy relative to full data while maintaining high structure fidelity (Hui-Mean et al., 10 Jan 2026).
  • Scalability and runtime: PCA-QS achieves practical runtimes (e.g., $0.21$ s per sample set for n=10,000n=10,000, d=500d=500, k=2k=2–$5$) compared to SRS ($0.015$ s), leverage ($0.09$ s), and coreset ($4.6$ s).
  • Clustering: On CoverType (7 classes, δ=0.2\delta=0.2), PCA-QS subsample k-means silhouette scores match full-data values within $0.05$ for 88.9%88.9\% of runs and within $0.1$ for 99.5%99.5\%.

Selected table excerpt for a 5%5\% sample of UCI CreditCard (n=30,000n=30{,}000, d=23d=23, k=10k=10, Q=10Q=10):

Metric PCA-QS (mean±std) SRS (mean±std)
Jensen–Shannon 0.0000 (0.0000) 0.0308 (0.0203)
Energy distance 0.0660 (0.0056) 4.1291 (0.5101)
KL divergence 0.0332 (0.0040) 17.1552 (0.1638)
MMD 0.0067 (0.0000) 0.4000 (0.0000)
Mahalanobis distance 9.1997 (0.0029) 8.9671 (0.4555)

(Hui-Mean et al., 23 Jun 2025)

6. Parameter Selection and Practical Deployment

Parameter choices for PCA-QS are determined by variance retention, spectral gap in eigenvalues, and quantization–projection error trade-offs:

  • PC count kk: Select kk so that cumulative variance captured ρk=(j=1kλj)/(j=1dλj)\rho_k = (\sum_{j=1}^{k} \lambda_j)/(\sum_{j=1}^d \lambda_j) exceeds a threshold (e.g., $0.70$–$0.95$). Scree plots and spectral gap heuristics are effective (Hui-Mean et al., 10 Jan 2026, Hui-Mean et al., 23 Jun 2025).
  • Quantile bins gg: Default g=5g=5 for moderate kk; higher gg increases fidelity but leads to bin sparsity for large kk or small nn. Bin merging, oversampling, or uniform fallback are logically indicated in sparse regions.
  • Retention rate δ\delta (ss): Optimal ranges are $0.01$–$0.10$, with lower values yielding increased computational savings at modest fidelity cost.

Sparse bin handling uses bin merging, internal oversampling, or fallback to SRS as appropriate. Parameter cross-validation using empirical regression MSE or distributional distances can refine settings (Hui-Mean et al., 23 Jun 2025).

7. Extensions and Applications

PCA-QS is adaptable to a variety of data modalities and subsampling objectives:

  • Experimental design: Stratification can be extended with adaptive experimental-design objectives within bins.
  • Active learning: Uncertainty-based subsampling may replace uniform selection inside bins.
  • Large-scale regression, clustering, and classification: Demonstrated efficacy for high-dimensional tabular, time-series, and text/NLP data.

The method's emphasis on interpretable, structure-preserving reduction makes it well-suited for summarizing datasets prior to downstream modeling, benchmarking, or exploratory analysis. Its empirical and theoretical properties indicate broad applicability in statistical computing and machine learning workflows (Hui-Mean et al., 10 Jan 2026, Hui-Mean et al., 23 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Principal Component Analysis guided Quantile Sampling (PCA-QS).