NeuralSort: Differentiable Sorting Operator

Updated 13 December 2025

NeuralSort is a continuous relaxation of the sorting operator that uses temperature-controlled softmax to approximate permutation matrices.
It converts discrete, non-differentiable rank operations into smooth, row-stochastic matrices for gradient-based optimization.
Practical applications include learning-to-rank, differentiable k-nearest neighbor, and direct optimization of ranking metrics like NDCG.

NeuralSort is a continuous, temperature-controlled relaxation of the sorting operator that enables differentiable sorting within neural computation graphs. The operator is central to a body of work addressing the challenge of making rank-based operations—typically non-differentiable and thus incompatible with gradient-based optimization—tractable for end-to-end learning systems. NeuralSort replaces the discrete permutation matrix corresponding to sorting with a unimodal row-stochastic matrix whose rows approximate soft assignments to ranks; as the temperature approaches zero, this matrix converges to the exact permutation matrix. This property has enabled applications in learning-to-rank (LTR), differentiable k-nearest neighbor algorithms, and direct optimization of ranking metrics such as NDCG, providing a rigorous bridge between discrete combinatorial objectives and continuous optimization landscapes (grover et al., 2019, Swezey et al., 2020, Pobrotyn et al., 2021).

1. Mathematical Definition and Core Construction

Let $s \in \mathbb{R}^n$ denote a vector of real-valued scores to be sorted in descending order. The discrete sort can be represented as a permutation matrix $P_{\mathrm{sort}(s)} \in \{0, 1\}^{n \times n}$ , where each row corresponds to a “rank” and selects the entry with the corresponding sorted value. NeuralSort constructs a continuous relaxation $\widehat{P}_\tau(s)$ , parameterized by a scalar temperature $\tau > 0$ , as follows:

Define the pairwise absolute-difference matrix:

$(A_s)_{ij} = |s_i - s_j|$

Set the rank offset $o_i = n+1-2i$ for $i = 1, \ldots, n$ .
For each row $i$ , compute:

$\widehat{P}_\tau(s)[i, :] = \mathrm{softmax}\left( \frac{o_i \cdot s - A_s 1}{\tau} \right)$

where $1 \in \mathbb{R}^n$ is the all-ones vector and softmax is row-wise.

By design, $P_{\mathrm{sort}(s)} \in \{0, 1\}^{n \times n}$ 0 is row-stochastic: its rows are non-negative and sum to one, each with a unique peak corresponding to a “soft” rank. As $P_{\mathrm{sort}(s)} \in \{0, 1\}^{n \times n}$ 1, the softmax sharpens into a hard argmax, so $P_{\mathrm{sort}(s)} \in \{0, 1\}^{n \times n}$ 2 almost surely when $P_{\mathrm{sort}(s)} \in \{0, 1\}^{n \times n}$ 3 has distinct entries (grover et al., 2019, Swezey et al., 2020, Pobrotyn et al., 2021).

2. Theoretical Properties and Unimodality

NeuralSort guarantees several key properties:

Unimodality: Each row of $P_{\mathrm{sort}(s)} \in \{0, 1\}^{n \times n}$ 4 has a unique maximum, ensuring correspondence to a single ranked item.
Row-stochasticity: All entries are non-negative and rows sum to one.
Consistency: Under mild conditions (distinct scores), $P_{\mathrm{sort}(s)} \in \{0, 1\}^{n \times n}$ 5.
Differentiability: The mapping $P_{\mathrm{sort}(s)} \in \{0, 1\}^{n \times n}$ 6 is everywhere continuous and (almost everywhere) differentiable for $P_{\mathrm{sort}(s)} \in \{0, 1\}^{n \times n}$ 7.
Permissibility for backpropagation: The operator can be inserted anywhere a permutation or sort would be required in a computational graph, enabling end-to-end optimization via automatic differentiation (grover et al., 2019, Pobrotyn et al., 2021).

3. Applications in Learning to Rank and Ranking Metrics

A principal motivation for NeuralSort is to allow direct minimization of rank-based objectives such as NDCG and ARP. Let $P_{\mathrm{sort}(s)} \in \{0, 1\}^{n \times n}$ 8 denote graded relevances, $P_{\mathrm{sort}(s)} \in \{0, 1\}^{n \times n}$ 9 the gain vector, and $\widehat{P}_\tau(s)$ 0 a scoring model producing predictions $\widehat{P}_\tau(s)$ 1. The ideal (non-differentiable) NDCG@k metric is

$\widehat{P}_\tau(s)$ 2

where $\widehat{P}_\tau(s)$ 3 and $\widehat{P}_\tau(s)$ 4 is the ideal permutation.

NeuralSort provides a differentiable surrogate by replacing $\widehat{P}_\tau(s)$ 5 with $\widehat{P}_\tau(s)$ 6:

$\widehat{P}_\tau(s)$ 7

This framework underpins several LTR surrogates, notably in PiRank and NeuralNDCG, where the loss is defined as $\widehat{P}_\tau(s)$ 8 (Swezey et al., 2020, Pobrotyn et al., 2021).

4. Algorithmic Implementation and Extensions

A vectorized “forward pass” for a batch of $\widehat{P}_\tau(s)$ 9 score-vectors $\tau > 0$ 0 proceeds as:

Compute pairwise-difference tensors: $\tau > 0$ 1.
Precompute rank-offset vector $\tau > 0$ 2.
Build pre-softmax logits: $\tau > 0$ 3.
Apply row-wise softmax for $\tau > 0$ 4.

An optional Sinkhorn normalization can be applied to obtain doubly-stochastic matrices (Pobrotyn et al., 2021).

For large list sizes $\tau > 0$ 5, a direct application of NeuralSort incurs $\tau > 0$ 6 complexity. PiRank introduces a divide-and-conquer extension:

View the vector as leaves of a $\tau > 0$ 7-level tree with branching $\tau > 0$ 8 so $\tau > 0$ 9.
At each merge level, apply NeuralSort to blocks, retaining only top- $(A_s)_{ij} = |s_i - s_j|$ 0 soft scores per node.
Compose the soft permutation across levels; the total complexity is reduced to $(A_s)_{ij} = |s_i - s_j|$ 1 (Swezey et al., 2020).

5. Empirical Performance and Benchmarks

In benchmarks on public LTR datasets (MSLR-WEB30K, Yahoo! C14), PiRank’s NeuralSort-based surrogates matched or outperformed established baselines (RankNet, LambdaRank, Softmax-loss, Approximate-NDCG, NeuralSort cross-entropy) on 13/16 metrics, with statistical significance on NDCG@5, 10, and 15. For example, on MSLR-WEB30K, PiRank achieved NDCG@10=0.4464 (best), and on Yahoo! C14, NDCG@10=0.7385 (best).**

An ablation showed that increasing training list size $(A_s)_{ij} = |s_i - s_j|$ 2 substantially improves performance for fixed test list sizes and top- $(A_s)_{ij} = |s_i - s_j|$ 3, with relative NDCG@1 gains greater than 10% as $(A_s)_{ij} = |s_i - s_j|$ 4 increases from 10 to 100 for $(A_s)_{ij} = |s_i - s_j|$ 5. A synthetic experiment on the divide-and-conquer depth parameter $(A_s)_{ij} = |s_i - s_j|$ 6 confirmed theoretical wall-clock speedups: $(A_s)_{ij} = |s_i - s_j|$ 7 for $(A_s)_{ij} = |s_i - s_j|$ 8 (flat NeuralSort), $(A_s)_{ij} = |s_i - s_j|$ 9 for $o_i = n+1-2i$ 0 (binary-merge PiRank) (Swezey et al., 2020).

When applied to differentiable $o_i = n+1-2i$ 1-nearest neighbor classification, NeuralSort achieved accuracy competitive with task-specific convolutional networks and markedly superior to classic $o_i = n+1-2i$ 2NN baselines: 99.5% on MNIST, 93.5% on Fashion-MNIST, and 90.7% on CIFAR-10 (grover et al., 2019).

6. Connections to Stochastic Optimization and Reparameterized Gradients

NeuralSort enables reparameterized stochastic optimization under permutation-valued distributions. Notably, for the Plackett–Luce distribution over permutations:

A permutation sample from $o_i = n+1-2i$ 3 with $o_i = n+1-2i$ 4 can be reparameterized by adding i.i.d. Gumbel noise to $o_i = n+1-2i$ 5 and sorting.
By replacing the discrete sort with the NeuralSort relaxation in the surrogate loss, one obtains a low-variance, reparameterized gradient estimator suitable for policy gradients and variational inference in permutation-structured problems (grover et al., 2019).

7. Limitations, Variants, and Practical Considerations

Several implementation aspects affect NeuralSort’s practical deployment:

Temperature selection: The smoothness-accuracy tradeoff is governed by $o_i = n+1-2i$ 6: small $o_i = n+1-2i$ 7 yields sharper approximations but potentially high gradient variance. Empirically $o_i = n+1-2i$ 8 is robust with $o_i = n+1-2i$ 9 often effective, and temperature annealing can sharpen the sort progressively (Pobrotyn et al., 2021).
Sinkhorn normalization (optional): For applications requiring doubly-stochastic constraints, post-processing via Sinkhorn scaling is feasible, though not inherently part of the original NeuralSort formulation (Pobrotyn et al., 2021).
Scalability to large lists: Direct $i = 1, \ldots, n$ 0 cost is prohibitive for large $i = 1, \ldots, n$ 1, motivating hierarchical merge-style relaxations as in PiRank (Swezey et al., 2020).

NeuralSort’s unimodal row-stochastic relaxation is distinct from the doubly-stochastic approaches (e.g., Sinkhorn operator), demonstrating superior accuracy on sorting and quantile regression tasks for small $i = 1, \ldots, n$ 2 (grover et al., 2019).

References:

(grover et al., 2019) Stochastic Optimization of Sorting Networks via Continuous Relaxations (Swezey et al., 2020) PiRank: Scalable Learning To Rank via Differentiable Sorting (Pobrotyn et al., 2021) NeuralNDCG: Direct Optimisation of a Ranking Metric via Differentiable Relaxation of Sorting