SMOTE: Synthetic Minority Oversampling Technique

Updated 16 January 2026

SMOTE Algorithm is a data-level approach that generates synthetic minority samples via linear interpolation of nearest neighbors to overcome class imbalance in classification tasks.
Its methodology densifies the minority class decision region by blending features of nearby samples, thereby reducing overfitting compared to random duplication.
Numerous variants, including Borderline-SMOTE and Deep SMOTE, address boundary sparsity and replication issues, improving model performance in domains such as medical diagnosis and fraud detection.

The Synthetic Minority Oversampling Technique (SMOTE) is a widely adopted data-level methodology for the amelioration of class imbalance in supervised learning, particularly classification. SMOTE constructs synthetic instances of the minority class by convex combinations of minority instances and their nearest neighbors in feature space, systematically expanding the decision region assigned to the minority class. It is foundational to a large family of oversampling algorithms, and its efficacy and mechanism have led to numerous enhancements and domain-specific adaptations.

1. Algorithmic Design and Mathematical Foundations

SMOTE operates by generating new minority samples through linear interpolation. Let the minority sample set be $D_{\text{min}} = \{\mathbf{x}_1, \ldots, \mathbf{x}_m\}$ , and let $K$ be the nearest-neighbor parameter (default $K=5$ ). For each $\mathbf{x}_i \in D_{\text{min}}$ , the $K$ nearest neighbors among the minority set are identified using an appropriate metric (typically Euclidean for continuous features). For the desired oversampling ratio $N$ , SMOTE proceeds as follows:

For each $\mathbf{x}_i$ $x_{i}$ , generate $\lceil N/100 \rceil$ $⌈ N /100 ⌉$ synthetic samples:
- Randomly select a neighbor $\mathbf{x}_{z_i} \in NK(\mathbf{x}_i)$ .
- Draw a random interpolation weight $\lambda \sim \mathcal{U}(0,1)$ .
- Construct the synthetic vector by
$K$ 0
The synthetic instances are appended to the training set as minority-class examples (Chawla et al., 2011, Apostolopoulos, 2020).

This approach generalizes naive oversampling (random duplication) by densifying the minority region in feature space, without restricting new instances to pre-existing points.

2. Theoretical Analysis and Density Properties

Recent theoretical works have revealed that standard SMOTE, under default settings, asymptotically approximates the minority-class density but chiefly by duplicating extant samples. Specifically, as the number of minority samples $K$ 1 increases with $K$ 2, the law of generated samples converges to the true minority distribution. However, there is a "copying effect": the distribution of synthetic points is biased towards their central seed, and SMOTE-generated density vanishes near the boundary of the minority support. This underpopulates the class boundary, reducing support for outlier or frontier instances (Sakho et al., 2024). The precise density formula involves integration over the $K$ 3-NN ball of the seed $K$ 4, and boundary effects are made rigorous through non-asymptotic bounds.

Enhancements such as CV-SMOTE (cross-validated neighbor parameter) and Multivariate Gaussian SMOTE (MGS) have been proposed to counteract boundary sparsity and the tendency to replicate density, respectively. MGS augments the minority class by sampling from the local empirical Gaussian distribution determined by the $K$ 5 neighbors, populating regions beyond the minority convex hull (Sakho et al., 2024).

3. Algorithmic Enhancements and Variants

The SMOTE family includes numerous variants that address its limitations or adapt it to specific domains:

Borderline-SMOTE: Focuses sample generation near the class boundary as determined by majority-neighbor counts in the $K$ 6 nearest neighbor graph (Glazkova, 2020).
ADASYN: Allocates more synthetic samples to “hard-to-learn” minority points surrounded by majority neighbors, using an adaptive distribution (Glazkova, 2020).
SMOTE-ENC: Encodes nominal features by a scalar reflecting minority-class affinity, enabling interpolation on both mixed and purely nominal feature sets (Mukherjee et al., 2021).
Geometric SMOTE (G-SMOTE): Samples synthetic points uniformly inside a deformed spheroid or hypersphere around the minority seed, bounded by the nearest majority instance, increasing diversity and coverage (Douzas et al., 2017).
Deep SMOTE: Uses a neural network regressor to learn the interpolation operator, generating reproducible synthetic samples and reducing run-to-run stochasticity seen in classic SMOTE (Mansourifar et al., 2020).
CGMOS (Certainty Guided Minority OverSampling): Weights minority seeds by the expected gain in classifier certainty under a Bayesian KDE model, with a provable guarantee to outperform vanilla SMOTE on training data (Zhang et al., 2016).
Hybrid Methods: Frameworks such as SMOTE-RUS-NC combine SMOTE with Neighborhood Cleaning Rule and Random UnderSampling to enhance sample quality and stability, outperforming classic sampling strategies even at extreme imbalance ratios (Newaz et al., 2022).
Counterfactual-Based SMOTE (CFA-SMOTE): Synthesizes plausible minority samples using counterfactual reasoning from XAI, then densifies them via SMOTE for domains with rare outlier events (Temraz et al., 14 Nov 2025).
SMOTE-CLS (SMOTE with Customizing Latent Space): Integrates a VAE-based filtering of minority points in latent space by disentangling class and sample difficulty, interpolating only among high-density clusters to remove noise and improve small-disjunct coverage (Hong et al., 2024).

4. Practical Applications and Evaluation

SMOTE and its variants are employed in domains with severe class imbalance, e.g., medical diagnosis, fraud detection, climate-change-driven outlier analysis, and multi-class text classification (Apostolopoulos, 2020, Temraz et al., 14 Nov 2025, Glazkova, 2020). Empirical evaluations employ metrics beyond overall accuracy: ROC AUC, F₁, precision, recall, and (when applicable) confusion matrices reporting real-vs-synthetic classification error.

Typical experimental setups balance minority-to-majority class counts either exactly or to a specified ratio. Classifiers include tree-based learners (Random Forest, REP-Tree), neural networks, SVMs (e.g., SPEGASOS), and statistical models. Adaptive or hybrid approaches (GMM-cluster filtering, ensemble embeddings) often yield improvements upwards of +0.05–0.59 in minority class F₁, with gains in AUC and G-mean on highly imbalanced sets (Tripathi et al., 2021, Douzas et al., 2017, Newaz et al., 2022). Models built on pure-SMOTE augmentation may exhibit inflated accuracy if synthetic samples dominate the test batches, necessitating metrics that isolate real sample error (Apostolopoulos, 2020).

5. Workflow Integration, Hyperparameter Selection, and Limitations

In practice, SMOTE should be introduced within the cross-validation fold or pipeline to avoid information leakage. Oversampling is typically preceded by rigorous feature selection to remove noisy or irrelevant features, as SMOTE interpolates blindly across all dimensions (Apostolopoulos, 2020). Parameter selection involves tuning $K$ 7 (NN count), oversample ratio $K$ 8, and for some variants, cluster count or density thresholds. Small $K$ 9 avoids interpolation across distant samples (reducing noise injection), while large $K=5$ 0 increases coverage but risks synthesizing ambiguous points. For hybrid and adaptive variants, auxiliary parameters (e.g. RUS ratio, GMM threshold, VAE latent bandwidths) are systematically grid-searched or selected by cross-validation.

Pitfalls include:

Synthetic instances may violate domain constraints, especially in medical applications, where feature plausibility must be validated.
Excessive augmentation can intrude into majority regions, boosting false positives.
Boundary points and small minority clusters may remain under-supported without specialization (G-SMOTE, SMOTE-CLS).
Noise amplification or label errors propagate via linear interpolation; filtering or cleaning steps (NC rule, VAE/KDE density filtering) can partially address this (Hong et al., 2024).
Empirically, in tasks with moderate imbalance, sophisticated rebalancing may yield little gain over careful modeling ("No rebalancing suffices"), while extreme ratios necessitate augmentation (Sakho et al., 2024).

6. Domain-Specific Adaptations and Novel Uses

SMOTE has increasingly been embedded in domain-integrated workflows. In privacy-preserving data collaboration (DC), SMOTE-anchored construction provides synthetic anchor sets that enhance feature selection and global predictive accuracy while minimizing data leakage risks (Imakura et al., 2022). In climate event prediction, CFA-SMOTE synthesizes plausible minority (extreme event) samples via counterfactual decomposition, then augments their neighborhood via classic SMOTE (Temraz et al., 14 Nov 2025). In multi-class and text domains, SMOTE and its descendants improve shallow classifier F₁-scores by up to 10–15pp, but neural models may show resilience to imbalance and benefit more modestly (Glazkova, 2020).

7. Empirical Validation and Comparative Performance

Large-scale evaluations confirm the general efficacy and limitations of SMOTE and its variants:

On binary imbalanced datasets (KEEL, UCI), adaptive-SMOTE, G-SMOTE, and hybrid frameworks (SMOTE-RUS-NC, SRN-BRF) consistently outperform vanilla SMOTE in minority class metrics and overall AUC, especially in severe imbalance settings (Douzas et al., 2017, Newaz et al., 2022).
Certainty-guided and VAE-filtered approaches demonstrate theoretical and empirical improvements in classifier likelihood and minority-specific AUC (Zhang et al., 2016, Hong et al., 2024).
In extreme imbalance, counterfactual or geometric augmentation (CFA-SMOTE, G-SMOTE) halves error rates on climate-event prediction (Temraz et al., 14 Nov 2025).
Information-theoretic distance weighting (MISMOTE, MAESMOTE, RESMOTE, TESMOTE) improves ROC performance by up to 0.03 AUC on multi-feature public and accident datasets (Sharifirad et al., 2018).

SMOTE remains central to class imbalance resolution but should be tailored—via appropriate variant selection, parameter tuning, and empirical validation—to dataset specifics and task constraints. Its ongoing evolution spans geometric, probabilistic, and domain-integrated extensions.