UMAP Mixup for On-Manifold Data Augmentation
- The paper introduces an on-manifold augmentation scheme combining UMAP embeddings with Mixup to confine synthetic samples to plausible data regions.
- The methodology leverages a UMAP-based embedding to perform Mixup in latent space, thereby enhancing regularization and reducing off-manifold interpolation.
- Experimental evaluations demonstrate lower RMSE on tabular and time-series regression tasks compared to standard Mixup and manifold Mixup approaches.
UMAP Mixup is a data augmentation scheme designed for deep learning regression models, in which synthetic examples are generated through convex combinations of embedded representations constrained to remain close to the data manifold. The method leverages a UMAP-based embedding to perform Mixup operations "on-manifold," addressing failures of conventional Mixup that may produce implausible samples by interpolating beyond the support of the data distribution. UMAP Mixup seeks to strike a balance between enhanced regularization from Mixup and geometric fidelity via manifold learning, leading to improved generalization, especially in tabular and time-series domains (El-Laham et al., 2023).
1. Origins and Motivation
Standard Mixup, as proposed by Zhang et al. (2017), replaces input-label pairs and with convex interpolations: Mixup encourages models to exhibit local linearity in the vicinity of training samples, thereby reducing overfitting and enhancing test-time robustness. However, when and do not admit a globally linear structure—common in non-visual or structured data—this process can generate samples lying outside the true data manifold (i.e., "off-manifold" regions), leading to manifold intrusion and implausibility.
UMAP (Uniform Manifold Approximation and Projection) is a nonlinear dimensionality reduction technique that constructs an underlying data graph encoding local neighborhood relationships (), then seeks an embedding-space graph that preserves these relationships by minimizing: By embedding data via a learned map , UMAP regularization enforces geometric preservation in intermediate representations. UMAP Mixup leverages this property, executing Mixup in embedding space to restrict interpolations to locations supported by the underlying data geometry.
2. Methodology and Algorithm
The UMAP Mixup approach parameterizes the predictor as , with serving as the UMAP-inspired embedding layer and the regressor.
Single iteration workflow:
- Sample a positive edge from the UMAP data graph (probability ).
- Draw .
- Compute embeddings , .
- Interpolate: .
- Predict: .
- Target: .
The batchwise Mixup loss is , typically with as mean-squared error. The UMAP regularizer is incorporated: In mini-batch training, edges and negatives are sampled, yielding the practical estimates
Algorithmic skeleton:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
Input: D={(x_i, y_i)}, UMAP params, Mixup α, reg weight γ
Initialize θ₁, θ₂
Precompute or update edge-probs p_{i,j} (via UMAP)
for epoch:
Sample minibatch of positive edges E⁺ and negatives E¯
Compute UMAP regularizer Ĉ(P, Q_{θ₁}) over E⁺∪E¯
For each (i,j) in E⁺:
λ ← Beta(α, α)
z_i ← h_{θ₁}(x_i), z_j ← h_{θ₁}(x_j)
z̃ ← λ z_i + (1−λ) z_j
ỹ ← g_{θ₂}(z̃), y_mix ← λ y_i + (1−λ) y_j
L_mix += ℓ(ỹ, y_mix)
Total loss L ← (L_mix/|E⁺|) + γ·Ĉ
θ ← θ − η ∇_θ L
until convergence |
3. Theoretical Rationale
UMAP Mixup regularizes the model under the vicinal risk minimization (VRM) framework, specifically addressing limitations where standard Mixup may interpolate outside the high-density region of data. By forcing through UMAP regularization, the latent -space preserves local neighborhoods and global topology, contingent on data satisfying mild manifold assumptions.
Mixup in -space thus approximates interpolation on a "coordinate chart" closely tracking the data manifold, minimizing selection of off-manifold synthetic samples. This constraint aids in preventing the model from fitting spurious modes and enhances generalization. VRM theory supports the view that restricting synthetic sample generation to plausible regions tightens generalization error bounds. Empirical observations corroborate improvement of model performance in settings where the manifold assumption is satisfied (Tabular UCI and time-series forecasting tasks) (El-Laham et al., 2023).
4. Experimental Evaluation
UMAP Mixup was evaluated across multiple regression benchmarks:
- Tabular UCI regression: Boston Housing (13 → house price), Concrete compressive strength (8 → strength), Yacht hydrodynamics (6 → residual resistance). Model: 2-layer MLP, Adam optimizer.
- Time-Series Forecasting: One-step forecasting with 60-day lookback using LSTM; datasets include GOOG (stable regime), RCL (distributional shift), GME (high-volatility).
The principal evaluation metric was RMSE (Root-Mean-Squared Error) on held-out folds, with the following comparative summary:
| Dataset | ERM | Mixup | Manifold Mixup | UMAP Mixup |
|---|---|---|---|---|
| Boston Housing | 3.14 ± 0.67 | 3.01 ± 0.71 | 3.10 ± 0.76 | 3.27 ± 0.66 |
| Concrete | 5.11 ± 0.59 | 5.92 ± 0.55 | 5.08 ± 0.62 | 4.83 ± 0.79 |
| Yacht | 0.91 ± 0.34 | 4.19 ± 0.63 | 0.80 ± 0.24 | 0.71 ± 0.21 |
| GOOG | 2.47 ± 0.05 | 2.47 ± 0.03 | 2.50 ± 0.03 | 2.43 ± 0.04 |
| RCL | 4.74 ± 0.69 | 4.07 ± 0.43 | 4.30 ± 0.60 | 3.13 ± 0.61 |
| GME | 3.66 ± 0.33 | 2.77 ± 0.49 | 3.83 ± 0.47 | 2.73 ± 0.37 |
UMAP Mixup achieves best or near-best test RMSE in the majority of experiments, with the most pronounced gains under regime shifts or distributional perturbations (e.g., RCL and GME time series).
5. Hyperparameter Configuration
- Mixup parameter : Optimized by cross-validation, with delivering robust performance.
- UMAP regularizer weight : Cross-validated; used by default.
- UMAP graph construction:
- Number of neighbors (): 15–50, selected by grid search.
- Metric: Euclidean; min_dist = 0.1 (default).
- Graph edges are computed prior to training and sampled in mini-batches.
Default configuration values are , , , and latent dimension .
6. Practical and Implementation Considerations
UMAP Mixup is well-suited for data believed to inhabit a low-dimensional manifold, including tabular, time series, and medical signal modalities. The primary computational cost arises from maintaining and sampling UMAP edges and regularizer computation, which with mini-batch sampling is per epoch and compatible with GPU acceleration.
Under computational constraints, the method is amenable to warm-starting from a pretrained UMAP embedding or downscheduling UMAP regularization (e.g., applying every 2–5 steps). Visualization of the latent -space (e.g., t-SNE) is recommended for confirmation that local neighborhoods are preserved and Mixup paths lie within high-density regions.
UMAP Mixup operationalizes the principle of "on-manifold" Mixup through explicit topological regularization in embedding space, synthesizing training examples that adhere more closely to the data’s intrinsic geometry and, in practice, yielding improved generalization in non-visual regression contexts (El-Laham et al., 2023).