ASSS: Antagonistic Soft Selection Subsampling

Updated 12 January 2026

The paper introduces ASSS as a novel adversarial framework that recasts data subsampling into a learnable, task-aware process using a minimax game between selector and task networks.
It employs the Gumbel-Softmax trick for continuous relaxation, enabling gradient-friendly sample weighting and effective end-to-end optimization.
Empirical evaluations on multiple tabular datasets show that ASSS outperforms traditional heuristic methods, sometimes improving over full data training through intelligent denoising.

Antagonistic Soft Selection Subsampling (ASSS) is an adversarial, fully differentiable data reduction paradigm designed to address the computational bottlenecks that arise in training predictive models on large-scale datasets. ASSS recasts data subsampling as a learnable, task-aware process, replacing static, task-agnostic preprocessing heuristics with a continuous and optimizable selection strategy. A minimax game between a selector network and a predictive (task) network governs the retention of informative samples, with the optimization objective rooted in the information bottleneck principle. Empirical evaluations indicate that ASSS outperforms standard heuristic subsampling methods, sometimes even surpassing the performance obtained by training on the full dataset through intelligent denoising (Lyu et al., 5 Jan 2026).

1. Adversarial Framework

Given a labeled dataset $D = \{(x_i, y_i)\}_{i=1}^{N}$ with $x_i \in \mathbb{R}^d$ and $y_i \in \{1, ..., K\}$ , ASSS establishes an adversarial (minimax) training dynamic between two neural networks:

Selector Network ( $G_\phi$ ): Assigns each input $x_i$ a real-valued logit $s_i$ , producing a selection probability $p_i = \sigma(s_i)$ , where $\sigma$ is the logistic sigmoid function. The resulting $p_i$ reflects the “soft” probability of including $x_i$ in the subsample.
Task Network ( $C_\theta$ ): Receives each $x_i$ attenuated by a continuous weight $z_i$ and outputs class probabilities $C_\theta(z_i x_i) \in \Delta^{K-1}$ for subsequent prediction.

The underlying optimization is bi-level but is approximated in practice by alternating gradient steps: $\min_{\,\phi\,}\;C_G(\phi, \theta^*(\phi)) \quad \text{subject to} \quad \theta^*(\phi) = \arg\min_{\theta} C_c(\theta, \phi)$ Here, the task-network loss $C_c$ is the cross-entropy over weighted samples, and the selector-network loss $C_G$ balances task fidelity, sparsity, and entropy (diversity) of the selected distribution.

Instead of direct, intractable nested optimization, ASSS alternates between updating $\theta$ and $\phi$ via stochastic gradient descent steps, thus yielding a practical minimax training regime that endows the selector with task awareness.

2. Continuous Weighting via Gumbel-Softmax

To enable direct optimization via gradient descent, ASSS introduces continuous relaxation of sample inclusion through the Gumbel-Softmax trick. For each sample $i$ :

Uniform random variables $u_i, u'_i \sim \mathrm{Uniform}(0, 1)$ are drawn.
Gumbel noises $g_i, g'_i$ are computed by $g_i = -\log(-\log u_i)$ and $g'_i = -\log(-\log u'_i)$ .
At temperature $\tau > 0$ , sample weight $z_i$ is set as: $z_i = \frac{\exp\bigl((\log p_i + g_i)/\tau\bigr)}{\exp\bigl((\log p_i + g_i)/\tau\bigr) + \exp\bigl((\log(1-p_i) + g'_i)/\tau\bigr)}$ As $\tau \rightarrow 0$ , $z_i$ approaches a hard Bernoulli draw; for higher $\tau$ , the relaxation is soft and more gradient-friendly. Annealing $\tau$ from $1.0$ to $0.1$ during training is empirically effective. This parameterization allows gradients $\partial C_c/\partial\phi$ to be propagated end-to-end from the task network back to the selector network.

3. Loss Functions and Objective

The learning dynamics in ASSS are governed by a pair of loss functions:

Task-Network Loss (predictive fidelity):

$C_c(\theta,\phi) = -\frac{1}{B}\sum_{i=1}^B z_i \sum_{k=1}^K \mathbf{1}\{y_i=k\} \log[C_\theta(z_i x_i)_k]$

This is the standard cross-entropy, where each data point’s contribution is weighted by $z_i$ .

Selector-Network Loss (fidelity, sparsity, entropy):

$C_G(\phi, \theta) = C_c(\theta, \phi) + \lambda \, \frac{1}{B}\sum_{i=1}^B p_i - \beta \, \Bigl[\tfrac{1}{B}\sum_{i=1}^B\bigl(p_i\log p_i+(1-p_i)\log(1-p_i)\bigr)\Bigr]$

The selector is penalized for exceeding a desired sample “budget” ( $\lambda$ term) and regularized to promote selection diversity ( $\beta$ term), preventing both sample collapse and excessive retention.

Minimax Game: Training alternately minimizes $C_c$ with respect to $\theta$ (task network) and $C_G$ with respect to $\phi$ (selector network).

4. Information Bottleneck Interpretation

There is a principled link between ASSS and the Information Bottleneck (IB) formalism. In IB, the objective is: $\max I(Z;Y) - B'I(Z;X)$ Here, $Z \in \{0,1\}^N$ is a binary vector marking selected samples, $Y$ are the labels, and $X$ is the data.

The objective in ASSS aligns as follows:

$I(Z;Y)$ is lower-bounded by $\mathbb{E}_{Z,Y}[\log q(y|Z)] + H(Y)$ , with $q(y|Z)$ approximated by the task network $C_\theta$ . The negative cross-entropy $C_c$ is thus a direct surrogate.
$I(Z;X) \approx \mathbb{E}[\|Z\|_1]$ , corresponding to the sparsity penalty over $\sum_i p_i$ .

Consequently, minimizing $C_G$ approximates maximizing the IB objective, balancing the predictive sufficiency of the subset and its compressiveness. This theoretical connection elucidates why ASSS selectively retains samples that are maximally informative for downstream prediction (Lyu et al., 5 Jan 2026).

5. Training Algorithm and Deployment

Training proceeds via the following high-level procedure:

Mini-batch sampling: Draw $\{(x_i, y_i)\}_{i=1}^B$ .
Selector step: Compute logits $s_i = G_\phi(x_i)$ , selection probabilities $p_i = \sigma(s_i)$ , sample Gumbel noises, and form weights $z_i$ (per equation above).
Task-network update: Compute $C_c(\theta, \phi)$ with the weighted mini-batch and update $\theta$ via gradient descent.
Selector-network update: Compute $C_G(\phi, \theta)$ with fresh or reused Gumbel draws, update $\phi$ .
Annealing: Adjust temperature $\tau$ .
Stabilization: Employ Two-Time-Scale Update Rule (TTUR), gradient clipping, and baseline subtraction as necessary.

At inference, $p_i$ is computed for the full dataset, then either:

Thresholding $p_i > \kappa$ for desired dataset compression; or
Selecting the top- $M$ samples by $p_i$ .

6. Empirical Evaluation and Quantitative Findings

ASSS was empirically assessed on four large-scale, real-world tabular datasets from the KEEL repository (Connect-4, KDD_Cup, FARS, Shuttle), each posing distinct challenges in terms of size, dimensionality, class balance, and boundary complexity.

Evaluation setup: 5-fold cross-validation with 10 repeats; each method retained 30% of the full data.
Classifier: 3-layer MLP, identical across all baselines.
Selector: 2 hidden layers, Adam optimizer, learning rates $\eta_\theta=10^{-3}$ , $\eta_\phi=10^{-4}$ , $\beta=0.1$ , annealing $\tau$ .
Metrics: Accuracy, macro-averaged F-measure, macro AUC, PRR (Performance Retention Rate).

Comparison to random sampling, $K$ -means clustering, and nearest neighbor thinning yielded the following results (PRR; higher is better):

Dataset	ASSS	Clustering	NN Thinning	Random
Connect-4	92.5%	85.6%	60.9%	~70%
FARS	99.2%	84.0%	75.3%	–
KDD_Cup	109.7%	95.4%	88.2%	–
Shuttle	≈98.1%	96.7%	97.2%	–

ASSS consistently outperformed all heuristic subsamplers, with the KDD_Cup dataset demonstrating PRR exceeding $100\%$ , indicating effective denoising and improved generalization relative to the full dataset (Lyu et al., 5 Jan 2026).

7. Practical Considerations and Limitations

ASSS shows strong advantages for tasks characterized by:

Complex, non-linear decision boundaries: Geometry-based heuristics become ineffective, while ASSS’s gradient-driven selector adapts to the task-targeted information content.
Noisy or imbalanced data: The selector network can identify and filter misleading or redundant samples, and, as in the KDD_Cup case, sometimes enhances performance beyond the full-data baseline.
Clusterable (easy) problems: On datasets with clear cluster structure or low intrinsic complexity, ASSS performs on par with clustering/thinning, without sacrificing fidelity.

Key hyperparameters for effective deployment include $\lambda$ (sparsity–fidelity trade-off), $\tau$ annealing schedule, and the learning-rate ratio ( $\eta_\theta \gg \eta_\phi$ ).

Limitations:

Increased computational overhead due to the adversarial training loop.
Training stability is sensitive, necessitating TTUR, gradient clipping, and possibly baseline subtraction.
Application to date has been confined to supervised classification of tabular datasets; further exploration is needed for other data modalities or unsupervised settings.

In summary, Antagonistic Soft Selection Subsampling operationalizes data reduction as a learnable, information-theoretic, and task-aware process. By jointly optimizing predictive fidelity and subsample compactness, ASSS establishes a new standard for effective large-scale data learning and provides foundational insights for differentiable dataset selection frameworks (Lyu et al., 5 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

A Differentiable Adversarial Framework for Task-Aware Data Subsampling (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Antagonistic Soft Selection Subsampling (ASSS).