GPA-Induced Contamination in Geometric Morphometrics
- GPA-induced contamination is defined as the artificial statistical dependence created when global alignment is performed on the full dataset before splitting, violating independence assumptions.
- Empirical studies and theoretical analysis show that this contamination leads to an optimistic bias in predictive performance, especially in small-sample or high-dimensional settings.
- Implementing a two-stage alignment protocol and leveraging spatially aware models are effective strategies to mitigate data leakage and ensure valid model evaluation.
GPA-induced contamination refers to the artificial statistical dependence introduced into machine learning pipelines for geometric morphometrics (GMM) when Generalized Procrustes Analysis (GPA)—a global shape alignment procedure—is performed prior to train–test splitting. This contamination, sometimes termed data-leakage, leads to violations of the conditional independence assumptions fundamental to statistical and machine learning model evaluation and can induce systematic bias in downstream predictive performance assessments. The phenomenon and its consequences have been formally characterized in recent work by Courtenay (2026) (Courtenay, 26 Jan 2026).
1. Generalized Procrustes Analysis and Data-Dependent Preprocessing
Generalized Procrustes Analysis is the canonical method for aligning landmark-based shape data, removing effects of translation, rotation, and scale. Given landmark configurations , GPA finds scale (), rotation (), and translation () parameters for each specimen to minimize the summed squared pairwise Procrustes distances:
The procedure iterates until a mean reference shape is stabilized. Critically, and all aligned coordinates depend on the entire pool . It is standard (though statistically unsound) practice in morphometric ML to apply GPA globally to the full dataset before partitioning into training, validation, and test splits. As a result, aligned coordinates for test specimens are not independent of the training data.
2. Mechanism and Formal Characterization of Contamination
GPA is a global operator: removing, replacing, or adding even a single specimen alters the reference shape , which in turn perturbs all aligned specimens. Empirical studies employing bootstrap alignments reveal that the Procrustes distance between a fixed specimen's alignments (with and without itself included in the pool) is inversely related to the total sample size, with effects intensifying at smaller 0.
The core consequence is that the representation of both training and test shapes is mutually entangled. Thus, the aligned test data are statistically dependent on the training set and vice versa. This violates the usual ML paradigm of conditional independence of train and test data, resulting in models that may spuriously "learn" regularities introduced by joint alignment rather than by properties of the underlying data-generating process itself. The net effect is an optimistic bias in performance metrics such as RMSE or classification accuracy, especially acute in small-sample or high-dimensional regimes (Courtenay, 26 Jan 2026).
3. Dimensionality, Predictive Error, and the "Diagonal" in 1 Space
The geometry of Procrustes shape space imposes concrete constraints on error scaling under GPA. Following Courtenay, the dimension of the tangent space after aligning 2 landmarks in 3 dimensions is
4
Given isotropic noise of variance 5 in each tangent-space direction, the expected total variance after GPA is 6. For naive regression or distance-based models acting on these 7 dimensions, expected mean-squared error (MSE) and root-mean-squared error (RMSE) are: 8 This implies that, for fixed RMSE, the number of landmarks 9 should scale as 0. In 2D, the optimal diagonal in the 1 grid has slope 2; in 3D, the slope is 3. Empirical simulations recover these theoretical expectations, with observed slopes of 0.33 in 2D and approximately 0.22 in 3D (Courtenay, 26 Jan 2026). These scaling laws reflect the fundamental "no-free-lunch" constraints imposed by the geometry of GPA-aligned data.
4. Role of Spatial Covariation and Model Architecture
Neglecting spatial autocorrelation among landmarks exacerbates the downstream impact of GPA-induced contamination. Linear regression on vectorized landmark coordinates (treating 4 inputs as independent) disregards the true geometric and biological adjacency of landmarks. In contrast, spatially aware architectures, such as convolutional neural networks with kernels respecting the 5 landmark "grid," can exploit local covariation.
In simulation, a vectorized linear model achieves mean RMSE 6, whereas a convolutional model achieves a lower mean RMSE 7, demonstrating that proper modeling of landmark adjacency recovers predictive signal otherwise eroded by both the global alignment and disregard for spatial structure (Courtenay, 26 Jan 2026).
5. Methodology for Eliminating Cross-Set Contamination
To prevent GPA-induced data leakage, alignment operations must be partitioned between training and held-out data. Courtenay proposes the following two-stage alignment protocol:
- Apply GPA to training data 8, producing aligned 9 and reference shape 0.
- For each test specimen 1:
- Center at origin (identical centroid as training),
- [Optional] Remove scale,
- Rotate 2 onto 3 using the optimal orthogonal Procrustes rotation,
- Aggregate to yield 4.
At no point does test data influence 5 or the training alignment, thus maintaining strict independence. In cross-validation contexts, this procedure must be repeated for each train–test split to ensure the property of independence is preserved (Courtenay, 26 Jan 2026).
6. Best Practices and Practical Constraints
The following guidelines are mandated to avoid GPA-induced contamination in morphometric ML workflows:
- Always perform data splits before any data-dependent transform (e.g., GPA, PCA, normalization).
- Compute GPA solely on the training set, then align or project held-out specimens.
- Adhere to sample-to-landmark ratios: in 2D, 6; in 3D, 7 ensures RMSE stability.
- Utilize spatially aware ML models such as convolutional or graph-based networks to respect and leverage landmark spatial relationships.
- During cross-validation, re-align test folds for each split rather than applying a global superimposition.
These practices ensure valid model evaluation and conform to the statistical geometry of Procrustes shape space (Courtenay, 26 Jan 2026).
7. Foundational Theoretical Constraints
GPA removes 8 translational, 9 rotational, and one scaling degree of freedom from the original 0 shape coordinates, leading to a reduced tangent-space of dimension 1. As a consequence, GPA is inherently a global, data-dependent transformation: its effect on any configuration depends on the total composition of the pool. This geometric constraint renders it impossible to achieve statistical independence between split subsets if GPA is performed globally prior to partitioning.
The observed diagonal in 2 error scaling underlines the necessity for commensurate growth of sample size with landmark count or the adoption of model structures that incorporate spatial covariation. The ultimate implication is that, for ML applications in geometric morphometrics, preprocessing workflows must be meticulously partitioned and model evaluation protocols fully account for the global nature of GPA (Courtenay, 26 Jan 2026).