Papers
Topics
Authors
Recent
Search
2000 character limit reached

FastSoftRank: Efficient Large-Scale Ranking via ALS

Updated 22 February 2026
  • FastSoftRank is a ranking framework that introduces quadratic surrogates (RG² and RGˣ losses) to approximate the softmax loss while preserving NDCG consistency.
  • It employs a fast ALS optimization with closed-form updates, achieving global linear convergence and reducing wall-clock time significantly compared to SGD.
  • Empirical evaluations on datasets like MovieLens and Amazon-Electronics show improvements in NDCG@10 and MRR over traditional sampling-based ranking methods.

FastSoftRank is a framework for large-scale ranking tasks, providing NDCG-consistent softmax approximation losses coupled with an accelerated alternating least squares (ALS) optimization scheme. The approach centers on quadratic surrogates derived via Taylor expansion of the standard softmax loss, enabling dramatic improvements in computational efficiency without sacrificing ranking quality on core metrics such as NDCG and MRR. FastSoftRank comprises the Ranking-Generalizable squared (RG²) loss and Ranking-Generalizable interactive (RGˣ) loss, both of which are directly aligned with ranking metrics and permit closed-form ALS updates with global linear convergence rates (Pu et al., 11 Jun 2025).

1. Motivation: Computational Bottlenecks in Softmax for Ranking

Softmax (cross-entropy) loss is Bayes-consistent with Discounted Cumulative Gain (DCG) and its normalized form (NDCG), indirectly maximizing a lower bound on NDCG, and is thus standard in many modern ranking architectures. However, the quadratic computational cost of full softmax becomes prohibitive as object corpora scale. Even sampled softmax surrogates—which randomly sample a subset of negatives—retain significant overhead due to exponential and normalization terms and exhibit slow convergence under standard stochastic gradient descent (SGD). This motivates the search for smooth, quadratic surrogates that retain NDCG-consistency, but permit efficient optimization, preferably with global linear convergence.

2. Quadratic Surrogates via Taylor Expansion: RG² and RGˣ Losses

The per-context softmax loss for a positive pair (x,y)(x, y) is

SM(o(x))=log(eoy(x)j=1Neoj(x)).\ell_{\rm SM}(o^{(x)}) = -\log \left( \frac{e^{o^{(x)}_y}}{\sum_{j=1}^N e^{o^{(x)}_j}} \right).

A second-order Taylor expansion about the zero vector yields

(o)logNoy+12No+1N212N2(1NTo)2.\ell(o) \approx \log N - o_y + \frac{1}{2N} \| o + \mathbf{1}_N \|^2 - \frac{1}{2N^2} (\mathbf{1}_N^T o)^2.

This produces two main surrogates by omitting or retaining the “interaction” term:

  • Ranking-Generalizable squared loss (RG²):

LRG2(o;y)=oy+12Nj=1N(oj+1)2\mathcal{L}_{\rm RG^2}(o;y) = -o_y + \frac{1}{2N} \sum_{j=1}^N (o_j + 1)^2

  • Ranking-Generalizable interactive loss (RGˣ):

LRG×(o;y)=oy+12No+1N212N2(1NTo)2\mathcal{L}_{\rm RG^\times}(o;y) = -o_y + \frac{1}{2N} \| o + \mathbf{1}_N \|^2 - \frac{1}{2N^2} (\mathbf{1}_N^T o)^2

Gradients with respect to oo are, respectively,

oLRG2(o;y)=ey+1N(o+1N)\nabla_o\,{\cal L}_{\rm RG^2}(o;y) = -e_y + \frac{1}{N}(o + \mathbf{1}_N)

oLRG×(o;y)=ey+1N(o+1N)1N2(1NTo)1N\nabla_o\,{\cal L}_{\rm RG^\times}(o;y) = -e_y + \frac{1}{N}(o + \mathbf{1}_N) - \frac{1}{N^2}(\mathbf{1}_N^T o)\mathbf{1}_N

3. Relationship to Weighted Squared Losses and Loss Paradigms

RG² can be interpreted as a weighted squared error over all (x,y)(x, y) pairs: x,ywx,y(oy(x)+1rx,yNIx)2,\sum_{x,y} w_{x,y}\left(o^{(x)}_y + 1 - r_{x,y}\frac{N}{|\mathcal{I}_x|}\right)^2, where SM(o(x))=log(eoy(x)j=1Neoj(x)).\ell_{\rm SM}(o^{(x)}) = -\log \left( \frac{e^{o^{(x)}_y}}{\sum_{j=1}^N e^{o^{(x)}_j}} \right).0 (click labels), SM(o(x))=log(eoy(x)j=1Neoj(x)).\ell_{\rm SM}(o^{(x)}) = -\log \left( \frac{e^{o^{(x)}_y}}{\sum_{j=1}^N e^{o^{(x)}_j}} \right).1 is the number of observed positives per context, and weights SM(o(x))=log(eoy(x)j=1Neoj(x)).\ell_{\rm SM}(o^{(x)}) = -\log \left( \frac{e^{o^{(x)}_y}}{\sum_{j=1}^N e^{o^{(x)}_j}} \right).2 depend on sample derivation: SM(o(x))=log(eoy(x)j=1Neoj(x)).\ell_{\rm SM}(o^{(x)}) = -\log \left( \frac{e^{o^{(x)}_y}}{\sum_{j=1}^N e^{o^{(x)}_j}} \right).3 This directly recovers the “weighted squared loss” (WSL) employed in ALS-based recommenders and unifies sampling-based with non-sampling surrogates, offering a direct ranking-consistency interpretation. RGˣ refines RG² by penalizing the sum of scores globally, thereby more closely tracking the original softmax function.

4. Optimization via Alternating Least Squares (ALS)

In FastSoftRank, the score matrix is factored as SM(o(x))=log(eoy(x)j=1Neoj(x)).\ell_{\rm SM}(o^{(x)}) = -\log \left( \frac{e^{o^{(x)}_y}}{\sum_{j=1}^N e^{o^{(x)}_j}} \right).4, with SM(o(x))=log(eoy(x)j=1Neoj(x)).\ell_{\rm SM}(o^{(x)}) = -\log \left( \frac{e^{o^{(x)}_y}}{\sum_{j=1}^N e^{o^{(x)}_j}} \right).5, SM(o(x))=log(eoy(x)j=1Neoj(x)).\ell_{\rm SM}(o^{(x)}) = -\log \left( \frac{e^{o^{(x)}_y}}{\sum_{j=1}^N e^{o^{(x)}_j}} \right).6. For RG², the objective becomes

SM(o(x))=log(eoy(x)j=1Neoj(x)).\ell_{\rm SM}(o^{(x)}) = -\log \left( \frac{e^{o^{(x)}_y}}{\sum_{j=1}^N e^{o^{(x)}_j}} \right).7

where

SM(o(x))=log(eoy(x)j=1Neoj(x)).\ell_{\rm SM}(o^{(x)}) = -\log \left( \frac{e^{o^{(x)}_y}}{\sum_{j=1}^N e^{o^{(x)}_j}} \right).8

ALS updates each SM(o(x))=log(eoy(x)j=1Neoj(x)).\ell_{\rm SM}(o^{(x)}) = -\log \left( \frac{e^{o^{(x)}_y}}{\sum_{j=1}^N e^{o^{(x)}_j}} \right).9 and (o)logNoy+12No+1N212N2(1NTo)2.\ell(o) \approx \log N - o_y + \frac{1}{2N} \| o + \mathbf{1}_N \|^2 - \frac{1}{2N^2} (\mathbf{1}_N^T o)^2.0 by solving the (o)logNoy+12No+1N212N2(1NTo)2.\ell(o) \approx \log N - o_y + \frac{1}{2N} \| o + \mathbf{1}_N \|^2 - \frac{1}{2N^2} (\mathbf{1}_N^T o)^2.1 normal equations: (o)logNoy+12No+1N212N2(1NTo)2.\ell(o) \approx \log N - o_y + \frac{1}{2N} \| o + \mathbf{1}_N \|^2 - \frac{1}{2N^2} (\mathbf{1}_N^T o)^2.2 with (o)logNoy+12No+1N212N2(1NTo)2.\ell(o) \approx \log N - o_y + \frac{1}{2N} \| o + \mathbf{1}_N \|^2 - \frac{1}{2N^2} (\mathbf{1}_N^T o)^2.3. The dominant per-iteration cost is (o)logNoy+12No+1N212N2(1NTo)2.\ell(o) \approx \log N - o_y + \frac{1}{2N} \| o + \mathbf{1}_N \|^2 - \frac{1}{2N^2} (\mathbf{1}_N^T o)^2.4.

ALS achieves global linear convergence rates ((o)logNoy+12No+1N212N2(1NTo)2.\ell(o) \approx \log N - o_y + \frac{1}{2N} \| o + \mathbf{1}_N \|^2 - \frac{1}{2N^2} (\mathbf{1}_N^T o)^2.5 for (o)logNoy+12No+1N212N2(1NTo)2.\ell(o) \approx \log N - o_y + \frac{1}{2N} \| o + \mathbf{1}_N \|^2 - \frac{1}{2N^2} (\mathbf{1}_N^T o)^2.6), outperforming the sublinear (o)logNoy+12No+1N212N2(1NTo)2.\ell(o) \approx \log N - o_y + \frac{1}{2N} \| o + \mathbf{1}_N \|^2 - \frac{1}{2N^2} (\mathbf{1}_N^T o)^2.7 convergence of SGD applied to the original or sampled softmax losses. Newton-CG achieves locally superlinear rates but incurs high computational cost due to Hessian-vector products.

5. Empirical Performance: Ranking Quality, Convergence, and Scalability

Benchmarks include MovieLens-10M, Amazon-Electronics, Steam (recommendation), and Simple-Wiki (link prediction), measured on NDCG@10 and MRR@10 (MAP@10 for Wiki). RG² and RGˣ consistently yield higher or matching ranking metrics compared to sampling-based baselines (e.g., BPR, BCE, Sampled-Softmax, Sparsemax, UIB, SML) and equal or slightly exceed the original full softmax. Observed NDCG@10 increases over softmax are 2–5%, with improvements over weighted-ALS (WRMF) of 4–10%.

In terms of optimization speed, ALS on the RG² or RGˣ objective reaches comparable NDCG plateaus in 10–15 epochs (each comprising one (o)logNoy+12No+1N212N2(1NTo)2.\ell(o) \approx \log N - o_y + \frac{1}{2N} \| o + \mathbf{1}_N \|^2 - \frac{1}{2N^2} (\mathbf{1}_N^T o)^2.8- and one (o)logNoy+12No+1N212N2(1NTo)2.\ell(o) \approx \log N - o_y + \frac{1}{2N} \| o + \mathbf{1}_N \|^2 - \frac{1}{2N^2} (\mathbf{1}_N^T o)^2.9-sweep), at 5–10× less wall-clock time than softmax SGD, which requires 50–100 epochs per plateau. For LRG2(o;y)=oy+12Nj=1N(oj+1)2\mathcal{L}_{\rm RG^2}(o;y) = -o_y + \frac{1}{2N} \sum_{j=1}^N (o_j + 1)^20 and LRG2(o;y)=oy+12Nj=1N(oj+1)2\mathcal{L}_{\rm RG^2}(o;y) = -o_y + \frac{1}{2N} \sum_{j=1}^N (o_j + 1)^21, a single ALS sweep requires seconds on a modern GPU, while softmax SGD requires several minutes.

6. Theoretical Significance and Practical Applicability

FastSoftRank provides a principled means of deriving NDCG-consistent quadratic losses, bridging the gap between sampling-based and non-sampling objectives in similarity learning. Its quadratic structure admits closed-form ALS updates with strong convergence guarantees, offering practical efficiency and theoretical clarity. The approach is algorithmically aligned with direct optimization of ranking metrics and is suitable for a broad class of large-scale ranking, recommendation, and link prediction tasks where both computational cost and metric fidelity are of paramount concern (Pu et al., 11 Jun 2025).

A plausible implication is that FastSoftRank positions quadratic ALS-based rankers as competitive with, and often superior to, deep cross-entropy-based matrix factorization frameworks in large-scale settings where time-to-convergence is critical and NDCG-centric evaluation governs model selection.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FastSoftRank.