Relaxed Top-K Operators

Updated 9 December 2025

RelaxedTopK is a class of techniques that approximate the non-differentiable top-k operation through methods like optimal transport, convex analysis, and successive halving.
These approaches enable gradient-based optimization by providing smooth surrogates, which are essential for efficient learning in applications such as deep models, ranking, and recommendations.
Practical schemes, including bucketed and threshold-based methods, trade exact selection for improved computational efficiency and scalable parallel performance on modern hardware.

RelaxedTopK operators encompass a significant class of techniques for relaxing the non-differentiable top-k selection operation. These relaxations are crucial in high-performance machine learning, optimization, and systems contexts, enabling gradient-based optimization and efficient parallelization where exact top-k selection is either computationally inefficient or incompatible with backpropagation. Approaches span continuous relaxations using optimal transport or convex analysis, approximate bucketed schemes for parallel hardware, and judiciously weakened semantics for lock-free data structures and databases.

1. Mathematical Formulations of RelaxedTopK

RelaxedTopK refers generically to smooth or approximate surrogates for the top-k operator, which hard-selects the k largest elements from a vector $x \in \mathbb{R}^n$ . The main classes are:

Optimal Transport (SOFT Top-k): The top-k mask is reframed as an extremal solution of an optimal transport (OT) plan $\Gamma^*$ between the data vector and "bins." The discrete mask $A \in \{0,1\}^n$ is relaxed via an entropy regularized transport plan $\Gamma^{*,\epsilon}$ , leading to a smooth output $A^\epsilon$ dependent on the regularization parameter $\epsilon$ (Xie et al., 2020).
Convex Analysis & Permutahedron LP: The hard mask is the solution to $\arg\max_{y \in P(1_k)} \langle x, y\rangle$ where $P(1_k)$ is the permutahedron with $k$ ones. Smoothing via a $p$ -norm yields $z^\star = \arg\max_{z\in P}\langle x, z\rangle - \lambda \|z\|_p^p$ , giving a continuous, sparsity-preserving relaxation (Sander et al., 2023).
Successive Halving (Tournament): A sequence of pairwise softmax "matches" reduces the candidate set, providing a differentiable approximation with complexity $O(n \log(n/k))$ (Pietruszka et al., 2020).
Simple Thresholding via Sigmoid: DFTopK defines $f_k(x)_i = \sigma((x_i - \theta(x))/\tau)$ with temperature $\tau$ and threshold $\theta$ between $x_{[k]}$ and $x_{[k+1]}$ , yielding closed-form, linear-time continuous relaxations (Zhu et al., 13 Oct 2025).
Bucketed Approximate Top-k: Input is partitioned into $b$ buckets, per-bucket Top- $k_b$ is performed, and results are merged, significantly increasing parallelism and throughput on accelerators (Key et al., 2024).
Relaxation in Priority Queues: The $k$ -LSM priority queue allows up to $\rho = T k$ elements to bypass strict minimum semantics in delete_min, trading determinism for scalability (Wimmer et al., 2015).
Statistical Estimation Under Order Constraints: RelaxedTopK can denote returning the expected top-k under uncertain values and partial orders, formally outputting the $k$ items with the highest expectation given order constraints (Amarilli et al., 2017).

Each of these formulations is designed to match, as closely as needed, the discrete semantics, while providing differentiability or computational advantages appropriate to the target application.

2. Algorithms and Computational Complexity

The principal RelaxedTopK algorithms and their computational complexity properties are as follows:

SOFT Top-k via Sinkhorn OT: Each Sinkhorn iteration is $O(n)$ for $m=2$ bins. The total forward pass is $O(nL)$ for $L$ iterations, with in-place GPU implementation feasible. The backward pass can leverage either unrolling (memory $O(Ln)$ ) or a closed-form implicit Jacobian (Xie et al., 2020).
Convex Permutahedron/Isotonic Regression: After initial sorting ( $O(n \log n)$ ), the Pool Adjacent Violators (PAV) or Dykstra's alternating projection solve the isotonic problem in $O(n)$ or $O(Tn)$ for $T$ iterations (TPU/parallel-friendly). Backpropagation employs explicit block-based Jacobians (Sander et al., 2023).
Successive Halving: Requires $T = \lceil \log_2(n/k)\rceil$ rounds, each with $O(n)$ pairwise operations, overall $O(n \log(n/k))$ ; significant practical speedup for large $n,k$ (Pietruszka et al., 2020).
DFTopK Closed-Form: Selects $x_{[k]}$ and $x_{[k+1]}$ via linear-time selection (e.g., Quickselect, $O(n)$ average), and applies elementwise sigmoid ( $O(n)$ ); total time $O(n)$ per forward or backward pass (Zhu et al., 13 Oct 2025).
Bucketed Approximate Top-k: Parallel per-bucket Top $-k_b$ in $O(n/b + k_b\log k_b)$ time per bucket, optional final $O(k\log k)$ if secondary selection needed, maximizing throughput on GPU/TPU. Empirically, speedups of 3–6× over exact Top-k are reported (Key et al., 2024).
$k$ -LSM Priority Queue: Amortized $O(\log n)$ updates; relaxation allows the queue to avoid contention at the cost of determinism (Wimmer et al., 2015).

These trade efficiency, parallelizability, and differentiability against possible deviations from the hard Top-k or strict ordering semantics.

3. Differentiable Relaxations: Gradients and Optimization

RelaxedTopK operators are designed to be compatible with gradient-based optimization. Key gradient properties include:

SOFT Top-k/OT: The dual formulation yields gradients via implicit function differentiation. For $\Gamma^{*,\epsilon}(x)$ , $\partial A^\epsilon / \partial x$ is constructed in closed form, supporting efficient reverse-mode AD. As $\epsilon \rightarrow 0$ , gradients can become ill-conditioned; careful tuning is required (Xie et al., 2020).
Isotonic/Permutahedron-based Relaxations: Jacobians for each isotonic block are obtainable analytically, with full $C^1$ smoothness for $p \in (1,2)$ and sparsity in the mask (Sander et al., 2023).
Successive Halving: The pairwise softmax and weighted sums are differentiable throughout. The backpropagation chain matches the forward tournament, with gradient propagation cost $O(nd\log(n/k))$ (Pietruszka et al., 2020).
DFTopK: All gradients are local except at two thresholded coordinates, dramatically reducing gradient competition seen in permutation-matrix approaches. This reduces train-time conflicts, particularly relevant in large-scale learning-to-rank and recommendation contexts (Zhu et al., 13 Oct 2025).
Statistical RelaxedTopK (order constraints): In probabilistic formulations, expectations are computed (exactly or via random walk sampling). Differentiability is less relevant; the surrogate is used for ranking when only incomplete or uncertain information is available (Amarilli et al., 2017).

Choice of regularization ( $\epsilon$ , $p$ , $\lambda$ , $\tau$ ) mediates between bias towards the hard Top-k and numerical stability of gradients.

4. Practical Applications and Empirical Performance

RelaxedTopK operators have been successfully incorporated into a variety of machine learning and systems applications:

Differentiable $k$ -Nearest Neighbors: SOFT Top-k used in end-to-end learnable kNN yields 99.4% (MNIST) and 92.6% (CIFAR-10) accuracy, outperforming Gumbel, NeuralSort, softmax-k, and two-stage relaxations (Xie et al., 2020).
Training of Deep Sparse Models: Sparsification of attention weights in transformers (SOFT Top-k, permutahedron relaxations) improves BLEU score (e.g., 37.3 vs 36.5 on WMT EN→DE) and enables more hardware-efficient models (Xie et al., 2020, Sander et al., 2023).
Vision Transformers and Sparse Mixture of Experts: RelaxedTopK forms enable efficient pruning (90% sparsity in MLPs), improved generalization, and precision gains in attention routing (Sander et al., 2023).
Differentiable Beam Search: Use of SOFT Top-k in beam search increases BLEU (from ∼35.4 to 36.3 on WMT EN→FR) by allowing gradient flow through selection (Xie et al., 2020).
Large-scale Recommendation: In RecFlow and industrial A/B tests, DFTopK outperforms NeuralSort, SoftSort, and LapSum (Recall@10@20: 0.4040 vs 0.3988), with +1.77% revenue lift at lower or equivalent computational cost (Zhu et al., 13 Oct 2025).
Highly Parallel Top-k for Accelerators: Approximate bucketed RelaxedTopK yields substantial GPU/TPU-accelerated Top-k throughput gains, with negligible recall drop across LLM and retrieval workloads (Key et al., 2024).
Lock-free Concurrent Data Structures: The $k$ -LSM priority queue achieves scalable, lock-free prioritized scheduling with analytically bounded deviation from exact Top-k semantics in highly concurrent environments (Wimmer et al., 2015).
Database/Uncertain Data Querying: RelaxedTopK under partial order and unknownness returns the expected top-k, supporting tree-structured constraints in $O(n^2)$ time and full generality via an FPRAS (Amarilli et al., 2017).

These results demonstrate the versatility and empirical strength of RelaxedTopK approaches across regimes of differentiability, scale, and parallelism.

5. Limitations, Tuning, and Design Trade-offs

RelaxedTopK variants require careful parameter selection and exhibit characteristic limitations:

Parameter Selection:
- For SOFT Top-k, $\epsilon$ controls bias and smoothness; too small leads to vanishing gradients, too large to diffuse allocations. $L$ (Sinkhorn iterations) balances accuracy and runtime (Xie et al., 2020).
- In permutahedron relaxations, $p=2$ is fast but may lack $C^1$ smoothness, $p\approx 4/3$ delivers sparse, differentiable masks but with increased overhead per block (Sander et al., 2023).
- DFTopK's temperature $\tau$ must be adapted to model and dataset; failing to do so can yield poor mask quality or barrier gradients (Zhu et al., 13 Oct 2025).
Approximation Error and Failure Modes:
- All continuous relaxations only recover the hard mask as the gap $\delta = x_{[k]} - x_{[k+1]}$ grows; small gaps or ties yield higher bias (Xie et al., 2020, Sander et al., 2023).
- Ties may break uniqueness even under smoothing; entropy regularization or block pooling often resolve most cases.
Complexity-Accuracy Tradeoffs:
- Exact $O(n\log n)$ performance is often unnecessary if task metrics tolerate small error; bucketed or successive halving schemes offer better hardware utilization (Pietruszka et al., 2020, Key et al., 2024).
- Systems relaxations ( $k$ -LSM) trade selection strictness for massive scalability; for priority scheduling, high $k$ often has negligible practical effect (Wimmer et al., 2015).
Implementation Recommendations:
- Prefer analytic Jacobians for block-based relaxations over unrolling; avoid unnecessary sorting or matrix operations (Sander et al., 2023).
- For architecture-aware approximate top-k, assign elements interleaved over buckets and keep queue size $k_b \leq 4$ for efficient register usage (Key et al., 2024).

These guidance points are essential for deploying RelaxedTopK in high-throughput, differentiable, or large-scale systems contexts.

6. Connections and Comparisons Across Frameworks

RelaxedTopK concepts unify disparate advances across optimization, machine learning, and systems research:

Optimal Transport vs Convex Analysis: OT-based SOFT Top-k yields dense but smooth approximations, whereas convex analysis over the permutahedron can yield exactly sparse, $C^1$ smooth masks for $1< p<2$ (Xie et al., 2020, Sander et al., 2023).
Sorting/Permutation-matrix vs. Direct Mask Approaches: Methods adhering strictly to permutation matrices (NeuralSort, SoftSort) suffer $O(n^2)$ or $O(n\log n)$ cost and high gradient coupling due to row/column normalization (Zhu et al., 13 Oct 2025). Mask-based approaches with adaptive thresholds (DFTopK) avoid these costs by relaxing normalization in favor of per-index monotonicity and partial sum constraints.
Learning-to-Rank (LTR) and Recommendation: DFTopK and its relatives mitigate objective misalignment and gradient conflicts seen in LTR frameworks (e.g., LambdaLoss, ARF, LCRON), supporting large-scale end-to-end recommendation model training (Zhu et al., 13 Oct 2025).
Parallel Scalability: Bucketed and relaxation-tuned approaches directly exploit high-bandwidth, many-core accelerators, critical in LLMs and batch scoring (Key et al., 2024).
Uncertainty-Aware Querying: RelaxedTopK generalizes to non-differentiable but tractable schemes for estimating top-k answers in the presence of order/uncertainty constraints, admitting exact, dynamic-programming-based, or FPRAS solutions (Amarilli et al., 2017).

A plausible implication is that new methods that flexibly interpolate between mask-based, isotonic, and threshold-based relaxations can further expand the efficiency and robustness envelope of RelaxedTopK in modern machine learning systems.