Papers
Topics
Authors
Recent
Search
2000 character limit reached

GPA-Induced Contamination in Geometric Morphometrics

Updated 2 February 2026
  • GPA-induced contamination is defined as the artificial statistical dependence created when global alignment is performed on the full dataset before splitting, violating independence assumptions.
  • Empirical studies and theoretical analysis show that this contamination leads to an optimistic bias in predictive performance, especially in small-sample or high-dimensional settings.
  • Implementing a two-stage alignment protocol and leveraging spatially aware models are effective strategies to mitigate data leakage and ensure valid model evaluation.

GPA-induced contamination refers to the artificial statistical dependence introduced into machine learning pipelines for geometric morphometrics (GMM) when Generalized Procrustes Analysis (GPA)—a global shape alignment procedure—is performed prior to train–test splitting. This contamination, sometimes termed data-leakage, leads to violations of the conditional independence assumptions fundamental to statistical and machine learning model evaluation and can induce systematic bias in downstream predictive performance assessments. The phenomenon and its consequences have been formally characterized in recent work by Courtenay (2026) (Courtenay, 26 Jan 2026).

1. Generalized Procrustes Analysis and Data-Dependent Preprocessing

Generalized Procrustes Analysis is the canonical method for aligning landmark-based shape data, removing effects of translation, rotation, and scale. Given nn landmark configurations XiRp×kX_i \in \mathbb{R}^{p \times k}, GPA finds scale (βi\beta_i), rotation (Γi\Gamma_i), and translation (αi\alpha_i) parameters for each specimen to minimize the summed squared pairwise Procrustes distances: Q=min{βi,Γi,αi}1ni=1nj=1n(βiXiΓi+1pαi)(βjXjΓj+1pαj)2Q = \min_{\{\beta_i, \Gamma_i, \alpha_i\}} \frac{1}{n} \sum_{i=1}^n \sum_{j=1}^n \left\| (\beta_i X_i \Gamma_i + \mathbf{1}_p \alpha_i) - (\beta_j X_j \Gamma_j + \mathbf{1}_p \alpha_j) \right\|^2

The procedure iterates until a mean reference shape Xˉ\bar X is stabilized. Critically, Xˉ\bar X and all aligned coordinates depend on the entire pool X1,,XnX_1,\dots,X_n. It is standard (though statistically unsound) practice in morphometric ML to apply GPA globally to the full dataset before partitioning into training, validation, and test splits. As a result, aligned coordinates for test specimens are not independent of the training data.

2. Mechanism and Formal Characterization of Contamination

GPA is a global operator: removing, replacing, or adding even a single specimen alters the reference shape Xˉ\bar X, which in turn perturbs all aligned specimens. Empirical studies employing bootstrap alignments reveal that the Procrustes distance between a fixed specimen's alignments (with and without itself included in the pool) is inversely related to the total sample size, with effects intensifying at smaller XiRp×kX_i \in \mathbb{R}^{p \times k}0.

The core consequence is that the representation of both training and test shapes is mutually entangled. Thus, the aligned test data are statistically dependent on the training set and vice versa. This violates the usual ML paradigm of conditional independence of train and test data, resulting in models that may spuriously "learn" regularities introduced by joint alignment rather than by properties of the underlying data-generating process itself. The net effect is an optimistic bias in performance metrics such as RMSE or classification accuracy, especially acute in small-sample or high-dimensional regimes (Courtenay, 26 Jan 2026).

3. Dimensionality, Predictive Error, and the "Diagonal" in XiRp×kX_i \in \mathbb{R}^{p \times k}1 Space

The geometry of Procrustes shape space imposes concrete constraints on error scaling under GPA. Following Courtenay, the dimension of the tangent space after aligning XiRp×kX_i \in \mathbb{R}^{p \times k}2 landmarks in XiRp×kX_i \in \mathbb{R}^{p \times k}3 dimensions is

XiRp×kX_i \in \mathbb{R}^{p \times k}4

Given isotropic noise of variance XiRp×kX_i \in \mathbb{R}^{p \times k}5 in each tangent-space direction, the expected total variance after GPA is XiRp×kX_i \in \mathbb{R}^{p \times k}6. For naive regression or distance-based models acting on these XiRp×kX_i \in \mathbb{R}^{p \times k}7 dimensions, expected mean-squared error (MSE) and root-mean-squared error (RMSE) are: XiRp×kX_i \in \mathbb{R}^{p \times k}8 This implies that, for fixed RMSE, the number of landmarks XiRp×kX_i \in \mathbb{R}^{p \times k}9 should scale as βi\beta_i0. In 2D, the optimal diagonal in the βi\beta_i1 grid has slope βi\beta_i2; in 3D, the slope is βi\beta_i3. Empirical simulations recover these theoretical expectations, with observed slopes of 0.33 in 2D and approximately 0.22 in 3D (Courtenay, 26 Jan 2026). These scaling laws reflect the fundamental "no-free-lunch" constraints imposed by the geometry of GPA-aligned data.

4. Role of Spatial Covariation and Model Architecture

Neglecting spatial autocorrelation among landmarks exacerbates the downstream impact of GPA-induced contamination. Linear regression on vectorized landmark coordinates (treating βi\beta_i4 inputs as independent) disregards the true geometric and biological adjacency of landmarks. In contrast, spatially aware architectures, such as convolutional neural networks with kernels respecting the βi\beta_i5 landmark "grid," can exploit local covariation.

In simulation, a vectorized linear model achieves mean RMSE βi\beta_i6, whereas a convolutional model achieves a lower mean RMSE βi\beta_i7, demonstrating that proper modeling of landmark adjacency recovers predictive signal otherwise eroded by both the global alignment and disregard for spatial structure (Courtenay, 26 Jan 2026).

5. Methodology for Eliminating Cross-Set Contamination

To prevent GPA-induced data leakage, alignment operations must be partitioned between training and held-out data. Courtenay proposes the following two-stage alignment protocol:

  1. Apply GPA to training data βi\beta_i8, producing aligned βi\beta_i9 and reference shape Γi\Gamma_i0.
  2. For each test specimen Γi\Gamma_i1:
    • Center at origin (identical centroid as training),
    • [Optional] Remove scale,
    • Rotate Γi\Gamma_i2 onto Γi\Gamma_i3 using the optimal orthogonal Procrustes rotation,
    • Aggregate to yield Γi\Gamma_i4.

At no point does test data influence Γi\Gamma_i5 or the training alignment, thus maintaining strict independence. In cross-validation contexts, this procedure must be repeated for each train–test split to ensure the property of independence is preserved (Courtenay, 26 Jan 2026).

6. Best Practices and Practical Constraints

The following guidelines are mandated to avoid GPA-induced contamination in morphometric ML workflows:

  • Always perform data splits before any data-dependent transform (e.g., GPA, PCA, normalization).
  • Compute GPA solely on the training set, then align or project held-out specimens.
  • Adhere to sample-to-landmark ratios: in 2D, Γi\Gamma_i6; in 3D, Γi\Gamma_i7 ensures RMSE stability.
  • Utilize spatially aware ML models such as convolutional or graph-based networks to respect and leverage landmark spatial relationships.
  • During cross-validation, re-align test folds for each split rather than applying a global superimposition.

These practices ensure valid model evaluation and conform to the statistical geometry of Procrustes shape space (Courtenay, 26 Jan 2026).

7. Foundational Theoretical Constraints

GPA removes Γi\Gamma_i8 translational, Γi\Gamma_i9 rotational, and one scaling degree of freedom from the original αi\alpha_i0 shape coordinates, leading to a reduced tangent-space of dimension αi\alpha_i1. As a consequence, GPA is inherently a global, data-dependent transformation: its effect on any configuration depends on the total composition of the pool. This geometric constraint renders it impossible to achieve statistical independence between split subsets if GPA is performed globally prior to partitioning.

The observed diagonal in αi\alpha_i2 error scaling underlines the necessity for commensurate growth of sample size with landmark count or the adoption of model structures that incorporate spatial covariation. The ultimate implication is that, for ML applications in geometric morphometrics, preprocessing workflows must be meticulously partitioned and model evaluation protocols fully account for the global nature of GPA (Courtenay, 26 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GPA-Induced Contamination.