Adaptive Minibatch Subsampling in Deep Learning

Updated 16 January 2026

The paper introduces UIDS, a framework using first- and second-order influence functions to guide adaptive, unweighted subsampling, improving risk reduction and generalization.
The methodology combines greedy batch selection and probabilistic influence sampling to efficiently compute Hessian and gradient metrics for data point selection.
Empirical results demonstrate that UIDS achieves faster convergence and lower error rates compared to traditional subsampling methods in classification and regression tasks.

Unweighted Influence Data Subsampling (UIDS) is a principled algorithmic framework that leverages statistical influence functions to select subsets of data for model training, aiming to maximize statistical efficiency and predictive performance under unweighted subsample regimes. Unlike classical weighted subsampling schemes, UIDS constructs subsamples in which all retained data points contribute equally (i.e., unit weight) during model fitting, while selection is guided by influence scores derived from first-order perturbations of the empirical risk minimizer or related objectives. This approach yields improved risk reduction, computational efficiency, and, under certain regimes, can achieve out-of-sample generalization superior to models trained on the full dataset (Raj et al., 2020, Wang et al., 2019, Ting et al., 2017).

1. Formalism and Influence Function Theory

UIDS is predicated on the use of influence functions, which quantify the infinitesimal impact of perturbing or reweighting a given data point on the model parameter $\hat\theta$ estimated by empirical risk minimization. For a training set $Z=\{z_i\}_{i=1}^N$ and loss function $\ell(z,\theta)$ , the estimator is

$\hat\theta(Z) = \arg\min_{\theta \in \Theta} \sum_{i=1}^N \ell(z_i, \theta).$

The (Bouligand) influence function for $z$ is

$\mathrm{BIF}(z; \hat\theta, Z) = - H_Z^{-1} \nabla_\theta \ell(z, \hat\theta(Z)),$

with $H_Z = \sum_{i=1}^N \nabla^2_\theta \ell(z_i, \hat\theta(Z))$ the empirical Hessian, assumed invertible under standard regularity (Raj et al., 2020).

Second-order influence functions extend to the effect of upweighting a sample on the out-of-sample risk $R(\hat\theta)$ , especially relevant for model selection and validation. For test point $z_j$ ,

$\phi(z_i, z_j) = - (\nabla_\theta \ell(\hat\theta; z_j))^\top H^{-1}_{\hat\theta} \nabla_\theta \ell(\hat\theta; z_i),$

and the total influence of $z_i$ over a test/validation set $V$ is

$\phi_i = \frac{1}{|V|} \sum_{j \in V} \phi(z_i, z_j).$

This formalism enables model-specific and risk-sensitive subsampling (Wang et al., 2019).

2. UIDS Algorithmic Schemes

Greedy Batch Influence Selection

A general instance of UIDS for subset selection proceeds iteratively. At each step, it fits parameters $\hat\theta_S$ on the current subset $S$ , computes candidate influence scores $\Delta_R(z)$ for $z \in Z \setminus S$ (projected reduction in risk $R$ ), then adds the top $m$ points with the maximal positive $\Delta_R(z)$ : $\Delta_R(z) = -\nabla_\theta R(\hat\theta_S)^\top H_S^{-1} \nabla_\theta \ell(z, \hat\theta_S).$ For validation risk, with $V$ the validation set: $\Delta_{\mathrm{val}}(z) = - \left( \sum_{v\in V} \nabla_\theta \ell(v, \hat\theta_S) \right)^\top H_S^{-1} \nabla_\theta \ell(z, \hat\theta_S).$ An $\varepsilon$ -greedy variant can mix top-influence and random points to avoid overfitting (Raj et al., 2020).

Probabilistic Influence Sampling

UIDS may also operate as a probabilistic scheme, replacing hard dropout (select all with $\phi_i < 0$ ) by sampling probability $\pi_i = \pi(\phi_i)$ decreasing in total influence:

Linear: $\pi_i = \mathrm{clip}(1 - \alpha \cdot \phi_i, 0, 1)$
Sigmoid: $\pi_i = 1 / (1 + \exp(\alpha \cdot \phi_i / \Delta))$ Points are drawn iid Bernoulli( $\pi_i$ ), resulting in subsets whose risk behavior is controlled by the smoothness of $\pi$ (Wang et al., 2019).

Unweighted Influence Estimation

In the variant from (Ting et al., 2017), for estimators admitting asymptotic linear influence function expansions: $\hat\theta(P_N) = \theta(P) + \frac{1}{N} \sum_{i=1}^N \psi(X_i; \theta) + o_p(1),$ the optimal Poisson sampling probability is $p_i \propto \|\psi_i\|$ , scaled to the chosen subsample size $n$ . Then, forming pseudo-observations $u_i = \psi_i / p_i$ ,

$\theta_\mathrm{UIDS} = \theta_0 + \frac{1}{n} \sum_{i \in S} u_i,$

where $\theta_0$ is a pilot estimate. Each sampled case is assigned unit weight for retraining.

3. Theoretical Guarantees

UIDS algorithms are accompanied by several nonasymptotic and asymptotic performance bounds:

Risk Reduction: For greedy batch selection, after $p$ adaptive steps, the expected risk of UIDS subset $\theta_{M+pm}^g$ is

$R(\theta_{M+pm}^g) - \mathbb{E}[R(\theta_{M+pm}^r)] \le \sum_{i=1}^{p} A'_{M+(i-1)m} + G\, m^2 \sum_{i=1}^p \frac{1}{(M + i m)^2},$

where $A'$ is the instantaneous gain over random and $G$ a constant. This guarantees UIDS is strictly better than uniform if some points have large influence (Raj et al., 2020).

Subset Superiority: If the covariance between perturbations and influence, $\mathrm{Cov}(\phi, \epsilon)$ , is negative, then the subset model risk improves over the full set: $R(\hat\theta_\epsilon; Q') \le R(\hat\theta; Q')$ . Lemma 1 shows the sample mean of influence is zero, so subsets chosen with negative correlation to $\phi$ are systematically superior (Wang et al., 2019).
Distributional Robustness: For probabilistic $\epsilon(\phi)$ maps with bounded gradient, worst-case risk over a $\chi^2$ -ball of nearby distributions is Lipschitz in $\phi$ , enabling regulation of generalization behavior and avoidance of overfitting on a validation set (Wang et al., 2019).
Variance Optimality: UIDS achieves the minimal possible asymptotic variance for estimating $\hat\theta$ via subsampling, matching the Horvitz–Thompson estimator while enabling unweighted fitting (Ting et al., 2017).

4. Computational Complexity and Practical Considerations

UIDS implementations are dominated by linear algebraic operations on the Hessian $H$ and gradient vectors.

Hessian Operations: Naïve inversion scales as $O(d^3)$ (dimension $d$ ), but conjugate gradient or sketching can reduce this to $O(d^2)$ per Hessian-vector product (Raj et al., 2020).
Influence Vector Computation: For $n$ points and $k$ iterations in preconditioned conjugate gradient (PCG), costs approximate $O(n d k + n d)$ (Wang et al., 2019).
Memory Footprint: Storing $H$ , $s$ , and $\phi$ costs $O(d^2)$ and $O(n)$ .
Retraining: Final ERM on the selected subset scales as $O(r n d T')$ , where $r < 1$ is the sampling ratio.

UIDS is computationally tractable for GLMs and moderate to large $n$ when using efficient linear solvers. Approximate Hessian methods (sketching, Hessian-free CG) scale to datasets with $10^6$ – $10^8$ samples. Batch size ( $m$ ) and seed size ( $M$ ) require practical tuning; small $m$ controls greedy approximation error, and $M \approx d$ ensures invertibility.

UIDS generalizes to non-differentiable models (e.g. tree ensembles) via proxy linear models, with strong empirical results on model transfer (Raj et al., 2020).

5. Empirical Outcomes and Benchmarks

UIDS has been evaluated on diverse tasks:

Model Selection: On Amazon employee-access and MNIST datasets, UIDS reaches target accuracy with 40–50% fewer samples than uniform subsampling (Raj et al., 2020).
Regression: California housing regression achieves RMSE 1.5 (UIDS) vs. 2.0 (random) for same subset sizes.
Hyperparameter Tuning: For random forest + Hyperband on Boston housing, UIDS finds optimal configurations 2× faster than random baseline.
Large-Scale Classification: On text, image, and CTR datasets (UCI, Criteo, Avazu, industrial sets), UIDS (linear/sigmoid variants) attains lower out-of-sample log-loss than uniform, weighted, or dropout schemes. On the 100M-sample “Company” dataset, UIDS achieves test log-loss of 0.1952 versus 0.1955 for the full set (Wang et al., 2019).
Subset-Model Superiority: Empirical results confirm that subset models via UIDS can outperform full-set ERM on future risk (Wang et al., 2019).

6. Limitations and Practical Guidance

UIDS efficacy depends on model and data regularity:

Hessian Nondegeneracy: Non-convex losses (e.g. deep networks) may violate invertibility, degrading influence estimates.
Influence Variability: Substantial gains over random sampling appear when data exhibits heterogeneous influence.
Overfitting: Pure dropout strategies can overfit validation sets, yielding poor generalization on unseen data; smoothing the sampling map (π) mitigates this risk (Wang et al., 2019).
Subset Size: Very small subsets or extreme dimensionality ( $d \gg n$ ) can introduce instability; initial subset size $M \approx d$ is recommended.
Memory Constraints: For very large $d$ , memory and computational demands call for approximate methods.
Applicability: For non-differentiable models, proxies enable effective transfer but may not fully match target model behavior.

Guidance suggests practical batch sizes $m \in [1,10]$ , validation set size 10–20% for influence scoring, and $\varepsilon$ -greedy random mixes ( $\varepsilon \sim 0.1$ ) for robustness.

UIDS distinguishes itself from weighted influence sampling and leverage-score or gradient-based subsampling:

Weighted Schemes: Horvitz–Thompson weighting achieves minimal variance but requires specialized handling for weights; UIDS attains equivalent variance with unweighted retraining (Ting et al., 2017).
Leverage and Gradient Sampling: Solely exploiting leverage scores (design-matrix geometry) or gradients is less efficient than influence-based weighting, especially when both sources of variability are important.
Dropout Subsampling: Data dropout (drop all $\phi_i > 0$ ) is brittle; UIDS's smooth probabilistic sampling achieves better robustness and generalization (Wang et al., 2019).
Computational Tradeoff: Weighted influence sampling may be computationally more intensive due to repeated influence calculation; UIDS mitigates this by single-pass calculation and unweighted fitting.

Empirical comparisons demonstrate UIDS requires 30–50% as many samples as uniform subsampling to achieve comparable estimator error in regression tasks, and consistently outperforms leverage-only and gradient-only schemes.

UIDS, as formalized in multiple frameworks, leverages influence-function theory to deliver flexible, statistically efficient, and computationally tractable subsampling algorithms for modern machine learning and statistical estimation tasks (Raj et al., 2020, Wang et al., 2019, Ting et al., 2017).

Markdown Report Issue Upgrade to Chat

References (3)

Model-specific Data Subsampling with Influence Functions (2020)

Less Is Better: Unweighted Data Subsampling via Influence Function (2019)

Optimal Sub-sampling with Influence Functions (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adaptive Minibatch Subsampling in Deep Learning.