Simplicial SMOTE: Geometric Oversampling
- Simplicial SMOTE is an advanced oversampling method that leverages high-dimensional simplices to synthesize minority class samples from local convex hulls.
- It generalizes classical SMOTE by sampling over p-simplices, thus offering improved distributional coverage and enabling closer approximation to the decision boundary.
- Empirical evaluations demonstrate significant improvements in F₁-score and Matthew’s correlation coefficient across various benchmark and synthetic datasets.
Simplicial SMOTE is an advanced geometric oversampling algorithm for addressing class imbalance in supervised learning. Extending the original SMOTE paradigm, it employs high-dimensional simplicial complexes to generate synthetic minority class samples that more densely and flexibly cover the feature space. By sampling from convex hulls or simplices—rather than the edges—of a k-nearest-neighbor (kNN) graph, Simplicial SMOTE achieves improved local distributional coverage and enables algorithmic generalizations of several established SMOTE variants (Kachan et al., 5 Mar 2025).
1. Background and Motivation
SMOTE (Synthetic Minority Oversampling Technique) introduced a geometric mechanism for class balancing by interpolating new minority points between existing ones along the edges defined by their kNN graph. Though successful, SMOTE’s reliance on 1-simplices (edges) restricts the synthetic distribution to unions of line segments, leading to insufficient filling of high-dimensional or nonconvex minority class regions. Tools from topological data analysis, specifically Vietoris–Rips clique complexes, supply the theoretical framework for more expressive local models, motivating the construction of Simplicial SMOTE (Kachan et al., 5 Mar 2025).
2. Construction of Neighborhood Simplicial Complex
Given a minority-class sample set , a symmetric kNN graph is constructed using the binary relation:
with denoting the -th nearest neighbor of . The higher-order neighborhood geometry is captured via the clique (Vietoris–Rips) complex , whose -simplices are all subsets of cardinality where every pair is an edge in . The -skeleton includes all simplices of dimension at most .
3. Simplicial SMOTE Algorithm
Simplicial SMOTE synthesizes minority class points by uniformly sampling from the convex hulls (simplices) in . The algorithm proceeds as follows:
- Input: Minority points , neighborhood size , maximum simplex dimension , target oversampling .
- Build , enumerate all maximal simplices in .
- For each synthetic point:
- Uniformly sample a maximal simplex from .
- Draw barycentric weights (uniform on -simplex).
- Compute for the sampled vertices.
- Append to the augmented set .
This process generalizes the original SMOTE’s pairwise interpolation to convex combinations over arbitrary local neighborhoods, ensuring that synthetic instances can reside anywhere in the convex hull of up to close minority points (Kachan et al., 5 Mar 2025).
4. Comparison with Classical SMOTE
The classical SMOTE algorithm samples only from 1-simplices, each synthetic point being with . In contrast, Simplicial SMOTE generalizes sampling to unions of -simplices (convex hulls of points), drastically expanding the local model to higher-dimensional domains.
Empirically and theoretically, this confers two principal advantages:
- Distributional coverage: High-dimensional simplices densely fill the local convex region, minimizing gaps present in purely edge-based models.
- Boundary proximity: For a set of equidistant minority points, the distance from the origin to their convex hull (e.g., for a triangle) is strictly less than the distance to any edge (), permitting a closer approximation to the minority-majority decision boundary (Kachan et al., 5 Mar 2025).
5. Simplicial Extensions of SMOTE Variants
Simplicial SMOTE’s geometric data model enables direct generalization of graph-based SMOTE extensions:
- Simplicial Borderline SMOTE: Restricts simplex selection to the “borderline set” and their minority neighbors, sampling only from .
- Simplicial Safe-level SMOTE: For a simplex , Dirichlet parameters are set as (where ), biasing the sampling toward safe minority regions.
- Simplicial ADASYN: Assigns each simplex an adaptivity weight , allocating more synthetic samples to neighborhoods with higher majority presence.
These extensions maintain the simplicial sampling core, with modifications to either the sampling domain, barycentric weight distribution, or frequency per simplex (Kachan et al., 5 Mar 2025).
6. Theoretical Properties and Empirical Performance
The convex union of -simplices more accurately models the local cluster hulls of minority data. As approaches the intrinsic dimensionality of a cluster, the average projection distance from majority points to the minority convex domain decreases, allowing the local decision boundary to move closer to the majority.
Empirical results across 21 UCI/LIBSVM benchmark datasets (dimensions 7–294, imbalance ratios 9–130) and 4 synthetic topological datasets (moons, Swiss rolls, concentric spheres, Gaussian in sphere) demonstrate:
- Simplicial SMOTE yields mean F₁-score improvements over SMOTE of approximately 4.5% for k-NN classifiers (up to +29.3% on “car_eval_4”) and 5.0% for gradient boosting (up to +25.7% on “oil”).
- Consistent improvements are observed in Matthew’s correlation coefficient.
- Simplicial forms of Borderline SMOTE, Safe-level SMOTE, and ADASYN outperform their classical counterparts.
- On synthetic data with complex topology, non-local sampling methods (e.g., global or Gaussian oversampling) fail, while Simplicial SMOTE achieves the best F₁-score (Kachan et al., 5 Mar 2025).
7. Context and Implications
Simplicial SMOTE generalizes the SMOTE framework by leveraging higher-order geometric and topological constructs to obtain a more representative and flexible sampling of minority class regions. A plausible implication is that this approach could be extended beyond binary class imbalance to structured, multi-class, or manifold-based learning problems. The empirical superiority of simplicial variants across diverse architectures and data modalities suggests broad utility for imbalanced learning scenarios where local minority structure is crucial (Kachan et al., 5 Mar 2025).