Sinkhorn-Routed Encoder

Updated 11 January 2026

The paper shows that Sinkhorn-Routed Encoder achieves quasi-global receptive fields through data-driven block sorting, significantly reducing memory and compute requirements.
It employs a differentiable Sinkhorn operator to approximate permutation matrices, enabling efficient local attention and effective sequence truncation via SortCut.
Empirical results demonstrate improved performance on language modeling and image generation tasks while maintaining competitive accuracy compared to full self-attention.

A Sinkhorn-Routed Encoder, also known as Sparse Sinkhorn Attention, is a memory- and compute-efficient attention architecture for Transformer models that leverages differentiable sorting via a Sinkhorn operator to route attention computation through dynamically re-ordered blocks of input sequences. The data-driven block sorting enables quasi-global receptive fields with local attention mechanisms, substantially reducing the memory and computation requirements compared to standard full self-attention while retaining competitive accuracy on tasks such as language modeling, sequence-to-sequence sorting, image generation, and textual entailment (Tay et al., 2020).

1. Architecture Overview

The Sinkhorn-Routed Encoder operates within a modified Transformer encoder block. An input sequence $X \in \mathbb{R}^{\ell \times d}$ is partitioned into $N_B$ contiguous blocks, each of size $B = \ell/N_B$ . A meta-sorting network (SortNet) summarizes each block (e.g., via sum-pooling or first-token selection) and scores block-level relationships through a small MLP, producing a score matrix $R \in \mathbb{R}^{N_B \times N_B}$ . This matrix is transformed into a doubly-stochastic matrix $P$ via the differentiable Sinkhorn operator, approximating a permutation matrix.

The (soft) permutation $P$ is used to re-order ("sort") the sequence blocks: $X_\text{sorted} = P \cdot \mathrm{Blockify}(X)$ . The sorted sequence is refolded to length $\ell$ , and standard block-local scaled dot-product attention is independently applied within each sorted block. Optionally, the SortCut operation truncates the sequence to the top $n$ blocks by importance after sorting, and a mixture with standard full attention over $X$ may be included for enhanced expressivity.

2. Differentiable Sorting and Sinkhorn Operator

Differentiable sorting is accomplished using the Sinkhorn operator $S_\epsilon : \mathbb{R}^{N \times N} \rightarrow \mathbb{R}^{N \times N}$ , which seeks a matrix in the Birkhoff polytope that minimizes the cost:

$S_\epsilon(R) = \arg\min_{P \in \text{Birkhoff}} \langle P, -R \rangle + \epsilon \sum_{ij} P_{ij} (\log P_{ij} - 1)$

where $R$ is the score matrix, and $\epsilon$ is an entropic regularization parameter. In practice, an iterative Sinkhorn-Knopp row/column normalization is performed on $K = \exp((R + G)/\tau)$ , where $G$ is optional Gumbel noise and $\tau$ is a temperature parameter. Each iteration alternates between row and column normalization:

$\mathrm{NormalizeRows}(M)_{ij} = M_{ij} / \sum_{j'} M_{ij'}$
$\mathrm{NormalizeCols}(M)_{ij} = M_{ij} / \sum_{i'} M_{i'j}$

K iterations are typically sufficient for a high-fidelity approximation to a permutation. For numerical stability, the operations are often carried out in log-domain.

Causal Sinkhorn Balancing ensures autoregressive property by masking future blocks when used for decoding: during column normalization, each row $i$ only attends to columns $j \leq i$ , implemented via masking and adjusted normalization.

3. Attention Routing and SortCut Truncation

After block sorting, each query token $i$ attends only to key tokens $j$ within its block in the new sorted order, i.e., $W(i) = \{j \mid \lfloor j/B \rfloor = \lfloor i/B \rfloor\}$ , and attention is computed locally:

$A_{ij} = \frac{\exp(q_i^\top k_j / \sqrt{d})}{\sum_{j' \in W(i)} \exp(q_i^\top k_{j'} / \sqrt{d})}$

with the output $y_i = \sum_{j \in W(i)} A_{ij} v_j$ .

The SortCut scheme further improves efficiency by selecting, after sorting, the top- $n$ most important blocks and discarding the remainder, resulting in attention complexity of $O(\ell n)$ for constant $n$ , versus $O(\ell B)$ (block-local) or $O(\ell^2)$ (full). The importance ranking is induced by the learned sorting, with attention only over the truncated block set.

4. Computational Complexity and Memory Usage

The Sinkhorn-Routed Encoder achieves significant reductions in time and memory complexity compared to vanilla Transformers:

Approach	Time / Memory Complexity	Principal Parameters
Full Attention	$O(\ell^2)$	Sequence length $\ell$
Block-local	$O(\ell B)$	Block size $B$
Sinkhorn Attention	$O(N_B^2 K_\text{iter} + \ell B)$	Num. blocks $N_B$ , KIter
Sinkhorn+SortCut	$O(\ell n + N_B^2)$	Truncation budget $n$

A practical example with $\ell=1024$ tokens and $N_B=64$ blocks yields $B=16$ and approximately $240\times$ less memory usage versus full attention. The sorting network and Sinkhorn normalization scale with $O(N_B^2 K_\text{iter})$ ; choosing $N_B \approx \sqrt{\ell}$ balances terms for optimal complexity (Tay et al., 2020).

5. Training and Optimization

The Sinkhorn-Routed Encoder is trained end-to-end with conventional, task-dependent primary loss functions (e.g., cross-entropy for language modeling), without auxiliary objectives for sorting. The Gumbel-Sinkhorn reparameterization enables differentiable sampling of approximate permutations, maintaining gradients for backpropagation. Automatic differentiation is employed in the Sinkhorn loop, commonly in the log domain for stability. Gradient clipping (with norm-1 or norm-5) is applied at the whole-Transformer and SortNet levels. The Adam optimizer is standard. Optimal hyperparameters include a sorting network of depth one (single linear layer), temperature $\tau \approx 0.75$ , and 5–10 Sinkhorn iterations. Too many iterations or a non-causal Sinkhorn on decoders degrades performance.

6. Empirical Performance

Benchmarks show that the Sinkhorn-Routed Encoder matches or surpasses both full and sparse Transformer variants on diverse tasks. On sequence-to-sequence sorting, the Sinkhorn model ( $N_B=32$ ) attains lower edit distance (0.4054) and higher exact match (49.2%) than a Sparse Transformer. For language modeling (LM1B, base model, 50M params), perplexity improves to 40.79 (Sinkhorn) from 41.57 (Transformer), further reduced to 40.11 in the mixture model. On larger word-level LM1B (430M params), Sinkhorn achieves 28.39 (mixture: 27.34) versus 27.59 (Transformer). Results on char-level LM1B, pixel-wise CIFAR-10 image generation, and document classification with SortCut encoders (IMDb, SNLI) demonstrate competitive accuracy and efficiency. Ablation studies indicate that disabling the Sinkhorn permutation (i.e., $K_\text{iter}=0$ ) severely degrades performance (LM1B PPL rises to 52.4 from 40.8) and validate the necessity of causal Sinkhorn for decoders (Tay et al., 2020).

The Sinkhorn-Routed Encoder introduces a paradigm of data-dependent, differentiable sequence reordering to unlock quasi-global attention with only local computation. By learning the permutation via end-to-end training and employing a principled Sinkhorn relaxation, the model combines the benefits of global receptive fields and scalability. This method relates to broader trends in efficient attention, sparse and block-based attention methods, and neural sorting. Innovations such as Causal Sinkhorn Balancing and SortCut truncation further extend its utility in both encoding and decoding contexts. The approach has become a reference point in subsequent work on learnable routing and efficient Transformer architectures (Tay et al., 2020).

Markdown Report Issue Upgrade to Chat

References (1)

Sparse Sinkhorn Attention (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sinkhorn-Routed Encoder.