Papers
Topics
Authors
Recent
Search
2000 character limit reached

ASSS: Antagonistic Soft Selection Subsampling

Updated 12 January 2026
  • The paper introduces ASSS as a novel adversarial framework that recasts data subsampling into a learnable, task-aware process using a minimax game between selector and task networks.
  • It employs the Gumbel-Softmax trick for continuous relaxation, enabling gradient-friendly sample weighting and effective end-to-end optimization.
  • Empirical evaluations on multiple tabular datasets show that ASSS outperforms traditional heuristic methods, sometimes improving over full data training through intelligent denoising.

Antagonistic Soft Selection Subsampling (ASSS) is an adversarial, fully differentiable data reduction paradigm designed to address the computational bottlenecks that arise in training predictive models on large-scale datasets. ASSS recasts data subsampling as a learnable, task-aware process, replacing static, task-agnostic preprocessing heuristics with a continuous and optimizable selection strategy. A minimax game between a selector network and a predictive (task) network governs the retention of informative samples, with the optimization objective rooted in the information bottleneck principle. Empirical evaluations indicate that ASSS outperforms standard heuristic subsampling methods, sometimes even surpassing the performance obtained by training on the full dataset through intelligent denoising (Lyu et al., 5 Jan 2026).

1. Adversarial Framework

Given a labeled dataset D={(xi,yi)}i=1ND = \{(x_i, y_i)\}_{i=1}^{N} with xiRdx_i \in \mathbb{R}^d and yi{1,...,K}y_i \in \{1, ..., K\}, ASSS establishes an adversarial (minimax) training dynamic between two neural networks:

  • Selector Network (GϕG_\phi): Assigns each input xix_i a real-valued logit sis_i, producing a selection probability pi=σ(si)p_i = \sigma(s_i), where σ\sigma is the logistic sigmoid function. The resulting pip_i reflects the “soft” probability of including xix_i in the subsample.
  • Task Network (CθC_\theta): Receives each xix_i attenuated by a continuous weight ziz_i and outputs class probabilities Cθ(zixi)ΔK1C_\theta(z_i x_i) \in \Delta^{K-1} for subsequent prediction.

The underlying optimization is bi-level but is approximated in practice by alternating gradient steps: minϕ  CG(ϕ,θ(ϕ))subject toθ(ϕ)=argminθCc(θ,ϕ)\min_{\,\phi\,}\;C_G(\phi, \theta^*(\phi)) \quad \text{subject to} \quad \theta^*(\phi) = \arg\min_{\theta} C_c(\theta, \phi) Here, the task-network loss CcC_c is the cross-entropy over weighted samples, and the selector-network loss CGC_G balances task fidelity, sparsity, and entropy (diversity) of the selected distribution.

Instead of direct, intractable nested optimization, ASSS alternates between updating θ\theta and ϕ\phi via stochastic gradient descent steps, thus yielding a practical minimax training regime that endows the selector with task awareness.

2. Continuous Weighting via Gumbel-Softmax

To enable direct optimization via gradient descent, ASSS introduces continuous relaxation of sample inclusion through the Gumbel-Softmax trick. For each sample ii:

  • Uniform random variables ui,uiUniform(0,1)u_i, u'_i \sim \mathrm{Uniform}(0, 1) are drawn.
  • Gumbel noises gi,gig_i, g'_i are computed by gi=log(logui)g_i = -\log(-\log u_i) and gi=log(logui)g'_i = -\log(-\log u'_i).
  • At temperature τ>0\tau > 0, sample weight ziz_i is set as: zi=exp((logpi+gi)/τ)exp((logpi+gi)/τ)+exp((log(1pi)+gi)/τ)z_i = \frac{\exp\bigl((\log p_i + g_i)/\tau\bigr)}{\exp\bigl((\log p_i + g_i)/\tau\bigr) + \exp\bigl((\log(1-p_i) + g'_i)/\tau\bigr)} As τ0\tau \rightarrow 0, ziz_i approaches a hard Bernoulli draw; for higher τ\tau, the relaxation is soft and more gradient-friendly. Annealing τ\tau from $1.0$ to $0.1$ during training is empirically effective. This parameterization allows gradients Cc/ϕ\partial C_c/\partial\phi to be propagated end-to-end from the task network back to the selector network.

3. Loss Functions and Objective

The learning dynamics in ASSS are governed by a pair of loss functions:

  • Task-Network Loss (predictive fidelity):

Cc(θ,ϕ)=1Bi=1Bzik=1K1{yi=k}log[Cθ(zixi)k]C_c(\theta,\phi) = -\frac{1}{B}\sum_{i=1}^B z_i \sum_{k=1}^K \mathbf{1}\{y_i=k\} \log[C_\theta(z_i x_i)_k]

This is the standard cross-entropy, where each data point’s contribution is weighted by ziz_i.

  • Selector-Network Loss (fidelity, sparsity, entropy):

CG(ϕ,θ)=Cc(θ,ϕ)+λ1Bi=1Bpiβ[1Bi=1B(pilogpi+(1pi)log(1pi))]C_G(\phi, \theta) = C_c(\theta, \phi) + \lambda \, \frac{1}{B}\sum_{i=1}^B p_i - \beta \, \Bigl[\tfrac{1}{B}\sum_{i=1}^B\bigl(p_i\log p_i+(1-p_i)\log(1-p_i)\bigr)\Bigr]

The selector is penalized for exceeding a desired sample “budget” (λ\lambda term) and regularized to promote selection diversity (β\beta term), preventing both sample collapse and excessive retention.

  • Minimax Game: Training alternately minimizes CcC_c with respect to θ\theta (task network) and CGC_G with respect to ϕ\phi (selector network).

4. Information Bottleneck Interpretation

There is a principled link between ASSS and the Information Bottleneck (IB) formalism. In IB, the objective is: maxI(Z;Y)BI(Z;X)\max I(Z;Y) - B'I(Z;X) Here, Z{0,1}NZ \in \{0,1\}^N is a binary vector marking selected samples, YY are the labels, and XX is the data.

The objective in ASSS aligns as follows:

  • I(Z;Y)I(Z;Y) is lower-bounded by EZ,Y[logq(yZ)]+H(Y)\mathbb{E}_{Z,Y}[\log q(y|Z)] + H(Y), with q(yZ)q(y|Z) approximated by the task network CθC_\theta. The negative cross-entropy CcC_c is thus a direct surrogate.
  • I(Z;X)E[Z1]I(Z;X) \approx \mathbb{E}[\|Z\|_1], corresponding to the sparsity penalty over ipi\sum_i p_i.

Consequently, minimizing CGC_G approximates maximizing the IB objective, balancing the predictive sufficiency of the subset and its compressiveness. This theoretical connection elucidates why ASSS selectively retains samples that are maximally informative for downstream prediction (Lyu et al., 5 Jan 2026).

5. Training Algorithm and Deployment

Training proceeds via the following high-level procedure:

  1. Mini-batch sampling: Draw {(xi,yi)}i=1B\{(x_i, y_i)\}_{i=1}^B.
  2. Selector step: Compute logits si=Gϕ(xi)s_i = G_\phi(x_i), selection probabilities pi=σ(si)p_i = \sigma(s_i), sample Gumbel noises, and form weights ziz_i (per equation above).
  3. Task-network update: Compute Cc(θ,ϕ)C_c(\theta, \phi) with the weighted mini-batch and update θ\theta via gradient descent.
  4. Selector-network update: Compute CG(ϕ,θ)C_G(\phi, \theta) with fresh or reused Gumbel draws, update ϕ\phi.
  5. Annealing: Adjust temperature τ\tau.
  6. Stabilization: Employ Two-Time-Scale Update Rule (TTUR), gradient clipping, and baseline subtraction as necessary.

At inference, pip_i is computed for the full dataset, then either:

  • Thresholding pi>κp_i > \kappa for desired dataset compression; or
  • Selecting the top-MM samples by pip_i.

6. Empirical Evaluation and Quantitative Findings

ASSS was empirically assessed on four large-scale, real-world tabular datasets from the KEEL repository (Connect-4, KDD_Cup, FARS, Shuttle), each posing distinct challenges in terms of size, dimensionality, class balance, and boundary complexity.

  • Evaluation setup: 5-fold cross-validation with 10 repeats; each method retained 30% of the full data.
  • Classifier: 3-layer MLP, identical across all baselines.
  • Selector: 2 hidden layers, Adam optimizer, learning rates ηθ=103\eta_\theta=10^{-3}, ηϕ=104\eta_\phi=10^{-4}, β=0.1\beta=0.1, annealing τ\tau.
  • Metrics: Accuracy, macro-averaged F-measure, macro AUC, PRR (Performance Retention Rate).

Comparison to random sampling, KK-means clustering, and nearest neighbor thinning yielded the following results (PRR; higher is better):

Dataset ASSS Clustering NN Thinning Random
Connect-4 92.5% 85.6% 60.9% ~70%
FARS 99.2% 84.0% 75.3%
KDD_Cup 109.7% 95.4% 88.2%
Shuttle ≈98.1% 96.7% 97.2%

ASSS consistently outperformed all heuristic subsamplers, with the KDD_Cup dataset demonstrating PRR exceeding 100%100\%, indicating effective denoising and improved generalization relative to the full dataset (Lyu et al., 5 Jan 2026).

7. Practical Considerations and Limitations

ASSS shows strong advantages for tasks characterized by:

  • Complex, non-linear decision boundaries: Geometry-based heuristics become ineffective, while ASSS’s gradient-driven selector adapts to the task-targeted information content.
  • Noisy or imbalanced data: The selector network can identify and filter misleading or redundant samples, and, as in the KDD_Cup case, sometimes enhances performance beyond the full-data baseline.
  • Clusterable (easy) problems: On datasets with clear cluster structure or low intrinsic complexity, ASSS performs on par with clustering/thinning, without sacrificing fidelity.

Key hyperparameters for effective deployment include λ\lambda (sparsity–fidelity trade-off), τ\tau annealing schedule, and the learning-rate ratio (ηθηϕ\eta_\theta \gg \eta_\phi).

Limitations:

  • Increased computational overhead due to the adversarial training loop.
  • Training stability is sensitive, necessitating TTUR, gradient clipping, and possibly baseline subtraction.
  • Application to date has been confined to supervised classification of tabular datasets; further exploration is needed for other data modalities or unsupervised settings.

In summary, Antagonistic Soft Selection Subsampling operationalizes data reduction as a learnable, information-theoretic, and task-aware process. By jointly optimizing predictive fidelity and subsample compactness, ASSS establishes a new standard for effective large-scale data learning and provides foundational insights for differentiable dataset selection frameworks (Lyu et al., 5 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Antagonistic Soft Selection Subsampling (ASSS).