Classification Imbalance as Transfer Learning

Published 15 Jan 2026 in stat.ML and cs.LG | (2601.10630v1)

Abstract: Classification imbalance arises when one class is much rarer than the other. We frame this setting as transfer learning under label (prior) shift between an imbalanced source distribution induced by the observed data and a balanced target distribution under which performance is evaluated. Within this framework, we study a family of oversampling procedures that augment the training data by generating synthetic samples from an estimated minority-class distribution to roughly balance the classes, among which the celebrated SMOTE algorithm is a canonical example. We show that the excess risk decomposes into the rate achievable under balanced training (as if the data had been drawn from the balanced target distribution) and an additional term, the cost of transfer, which quantifies the discrepancy between the estimated and true minority-class distributions. In particular, we show that the cost of transfer for SMOTE dominates that of bootstrapping (random oversampling) in moderately high dimensions, suggesting that we should expect bootstrapping to have better performance than SMOTE in general. We corroborate these findings with experimental evidence. More broadly, our results provide guidance for choosing among augmentation strategies for imbalanced classification.

Abstract PDF Upgrade to Chat

Summary

The paper shows that bootstrapping nearly achieves oracle performance, outperforming SMOTE in moderate to high-dimensional settings.
It introduces a theoretical framework that decomposes excess risk into oracle risk and a distinct transfer cost linked to label shift.
Empirical simulations confirm that random oversampling is often more effective than synthetic methods under conventional imbalanced learning conditions.

Classification Imbalance as a Transfer Learning Problem

This paper, "Classification Imbalance as Transfer Learning" (2601.10630), provides a rigorous statistical framework for imbalanced classification by formulating it as a transfer learning task under label (prior) shift. The work systematically analyzes oversampling-based rebalancing methods—particularly SMOTE and bootstrapping—by decomposing estimation error into the rate attainable under balanced training and an explicit, quantifiable "cost of transfer" attributable to mismatches between the true and estimated conditional distributions in the minority class.

Theoretical Framework

The central premise is that practitioners care about classifier performance under a balanced target distribution (equal class priors), even when only data generated under an imbalanced source distribution is available. The target setting is formally defined as label shift, where the class prior changes while class-conditional feature distributions are assumed invariant. Thus, classifiers must be trained such that empirical risk is minimized for the balanced target, not the source sampling regime.

The authors formalize this with distributions:

$P_{X,Y}$ : observed (imbalanced) data distribution
$Q_{X,Y}$ : hypothetical (balanced) target distribution
$f^*(x) := Q(Y=1 \mid X)$ : target conditional probability The aim is estimation of $f^*$ using only source-sampled data.

Analysis of Rebalancing via Synthetic Data Generation

A general "rebalancing" meta-algorithm is studied, encapsulating methods that create synthetic minority-class samples (possibly via a learned model) and concatenate them to the training set to approximate a balanced training regime. This covers classical approaches such as:

Bootstrapping: Random oversampling from observed minority samples.
SMOTE: Generation of synthetic points along line segments between minority samples and their neighbors.
Density/score-based–generative methods: Using kernel density estimation or diffusion models.

The primary theoretical results decompose the excess risk (difference between expected loss under the estimated $f$ and the oracle $f^*$ under $Q$ ) into two terms:

Oracle risk (convergence rate as if balanced data from $Q$ were observed),
Cost of transfer, a discrepancy measure reflecting the divergence between the true and estimated minority-class feature distributions.

Excess Risk Bounds

For arbitrary synthetic generators, the excess risk under $Q$ is bounded by:

$\mathbb{E}_Q[\ell(Y, f(X)) - \ell(Y, f^*(X))] \leq \text{Oracle}_{N+J}(Q) + \text{Cost}(P_{X|Y=1}, \hat{P}_{X|Y=1})$

where $J$ is the number of synthetic samples, and Cost is typically measured in total-variation or $\chi^2$ divergence.

Under additional Lipschitz and strong convexity assumptions on the loss, localized Rademacher complexity arguments yield fast convergence rates for parametric models, with additive transfer cost depending on $\chi^2$ -divergence:

$\|f-f^*\|_{Q_X} \lesssim w_{N+J}(Q_X) + w_{N+J}(\hat{P}_{X|Y=1}) + \chi^2(\hat{P}_{X|Y=1}; P_{X|Y=1})$

Comparative Performance: Bootstrapping vs. SMOTE

A primary numerical and theoretical insight is that bootstrapping nearly matches the estimation rate of an oracle with $2N_1$ minority samples. In contrast, SMOTE's error is dominated by a term scaling as $N_1^{-1/d}$ , reflecting the curse of dimensionality typical of nearest-neighbor–based methods. This result is non-asymptotic, explicitly quantifying dimensional and sample-size dependencies. Empirical results in simulated settings reinforce that SMOTE's risk often exceeds that of bootstrapping in even moderate dimensions, with the risk ratio growing in $d$ .

Key Theoretical Result:

For moderate or high $d$ , bootstrapping is strongly preferred to SMOTE unless class-conditional structure can be explicitly exploited (e.g., via parametric density estimation or high-quality diffusion models).

The analysis further clarifies that SMOTE's performance depends critically on assumptions (global Lipschitz continuity w.r.t. features), limiting its applicability to modern, highly non-linear models without explicit regularization.

Plug-in and Undersampling Baselines

An alternative, plug-in estimator is considered: after estimating $P(Y=1|X)$ from imbalanced data, apply an algebraic adjustment to recover $Q(Y=1|X)$ . This approach is simple and implementation-friendly, especially in high-dimensional or structured domains (e.g., text, images), where sample generation may be unrealistic. However, its convergence rate is sub-optimal compared to rebalancing methods, with an unfavorable scaling in the rare-class prior.

Undersampling of the majority class achieves bounds analogous to those for bootstrapping (up to logarithmic factors), with the tradeoff of reduced statistical efficiency in small-sample or severely skewed regimes.

Practical and Theoretical Implications

The work provides clear, theoretically justified guidance for practitioners:

In most practical scenarios (especially $d$ large, $N_1$ small), random oversampling is at least as good as, and often better than, SMOTE.
Rebalancing via density estimation or advanced generative models should be considered only when additional structural information about the minority-class distribution is available.
Plug-in adjustment is easy to implement but should be expected to underperform sample-generating methods statistically.
Undersampling is attractive computationally for large $N_0$ .

Importantly, the results challenge the default use of SMOTE and its variants for imbalanced data. The curse-of-dimensionality term in SMOTE's error scaling is sharp and unavoidable in the analyzed framework.

Extensions

The methodology generalizes to:

Arbitrary target prior mixtures (cost-sensitive risk, regulatory compliance, etc.).
Data-dependent choices of synthetic sample count.
Diffusion model–based generation under suitable assumptions.

Numerical Results

Simulations confirm the theoretical predictions: as $d$ grows, the gap between SMOTE's excess risk and that of bootstrapping widens, with bootstrapping consistently outperforming SMOTE unless $d$ is very small.

Future Directions

Open problems include:

Adaptive estimation or data-driven selection of the minimal-cost transfer method,
Leveraging more sophisticated transfer learning paradigms beyond label shift,
Sharp characterization or exploitation of hidden structure in real-world minority-class distributions.

Conclusion

By framing classification imbalance as label-shift transfer learning and rigorously analyzing the cost of transfer for oversampling methods, this work (2601.10630) demonstrates that—contrary to common practice—bootstrapping is preferable to SMOTE in moderate and high-dimensional settings. The transfer learning perspective enables unification and extension of existing methods, yielding improved, non-asymptotic guarantees and directly actionable methodological guidance. These results have strong implications for both theory and practice in imbalanced learning, especially in contemporary high-dimensional ML applications.

Markdown Report Issue