Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dual-Encoder & Probe Classifiers for XMC

Updated 21 January 2026
  • Dual-encoder and probe classifiers are architectures that map queries and labels into a shared embedding space to rank relevant labels in extremely large label sets.
  • Unified loss functions combined with PSL sampling reduce computational costs by focusing on a narrowed label pool while maintaining high precision.
  • Empirical benchmarks demonstrate these methods achieve state-of-the-art performance and significant resource efficiency, enabling scalable XMC on commodity hardware.

Dual-encoder (DE) architectures and probe classifiers are critical components for extreme multi-label classification (XMC), where the task is to predict a subset of relevant labels from an extremely large label space by leveraging both input queries and label text features. Traditionally, models have employed dual encoders to map queries and labels into a shared embedding space, often complemented by one-vs-all (OvA) classifiers that rerank shortlisted labels. This paradigm achieves strong empirical performance but has historically suffered from prohibitive computational costs due to the need for loss evaluation across massive label spaces. Recent work such as UniDEC addresses these challenges by unifying dual-encoder and classifier training with multi-class loss formulations and novel sampling strategies that significantly improve resource efficiency and scalability (Kharbanda et al., 2024).

1. Dual-Encoder Architecture for Extreme Multi-Label Classification

A dual-encoder framework for XMC begins with a shared backbone text encoder, denoted Φ\Phi, which maps any textual input (query or label) tt into a dΦd_\Phi-dimensional space: Φ(t)RdΦ\Phi(t) \in \mathbb{R}^{d_\Phi}. Two projection heads further process this output:

  • DE head: z=g1(Φ(t))Rdz = g_1(\Phi(t)) \in \mathbb{R}^{d} followed by L2 normalization to yield z^=z/z\hat{z} = z/\|z\|.
  • Classifier head: h=g2(Φ(t))Rdh = g_2(\Phi(t)) \in \mathbb{R}^{d}, without normalization.

In the DE tower, both input queries xx and label texts label_textllabel\_text_l are encoded via Φ\Phi and g1g_1, producing normalized embeddings z^q\hat{z}_q and z^l\hat{z}_l. Similarity is defined by the inner product sq2l(i,l)=z^qi,z^ls_{q2l}(i,l)=\langle \hat{z}_{q_i}, \hat{z}_l \rangle.

The probe (one-vs-all) classifier head utilizes a learnable lookup matrix ΨRL×d\Psi \in \mathbb{R}^{L \times d}, where each row Ψl\Psi_l serves as the classification weight vector for label ll. Query representations hqih_{q_i} are scored against Ψl\Psi_l via sclf(i,l)=hqi,Ψls_{clf}(i,l)=\langle h_{q_i}, \Psi_l \rangle.

2. Unified Loss Functions and Pick-Some-Labels (PSL) Reduction

Exact loss computation over the entire label space, such as with the PAL-N loss,

LPALN(xi)=1PilPilog(exp(s(i,l)/τ)k=1Lexp(s(i,k)/τ))L_{PAL-N}(x_i) = -\frac{1}{|P_i|} \sum_{l \in P_i} \log\left(\frac{\exp(s(i,l)/\tau)}{\sum_{k=1}^L \exp(s(i,k)/\tau)}\right)

is intractable for label sizes L106L \sim 10^6. The PSL reduction addresses this by sampling a subset of labels per batch. For each query, up to β\beta positives and η\eta hard negatives (from an external ANNS index) are selected, and all in-batch positives for other queries are included. The resulting label pool LBL_B typically includes $1500$–$4000$ labels.

Within this reduced pool, multi-class cross-entropy losses are computed for both DE and classifier heads in symmetric “query-to-label” and “label-to-query” directions:

  • For the DE head, PSL loss is

Lq2lDE=iQB[1PiBpPiBlogexp(z^qi,z^lp/τ)lLBexp(z^qi,z^l/τ)]L_{q2l}^{DE} = \sum_{i \in Q_B} \left[-\frac{1}{|\mathcal{P}_i^B|} \sum_{p \in \mathcal{P}_i^B} \log \frac{\exp(\langle \hat{z}_{q_i}, \hat{z}_{l_p}\rangle/\tau)}{\sum_{l \in L_B} \exp(\langle \hat{z}_{q_i}, \hat{z}_l\rangle/\tau)}\right]

analogous forms apply for both Ll2qDEL_{l2q}^{DE} and classifier losses.

  • The total unified loss is

L=λLDE+(1λ)LclfL = \lambda \cdot L_{DE} + (1-\lambda) \cdot L_{clf}

where each LDEL_{DE} and LclfL_{clf} is itself a symmetric combination (λDE=0.5\lambda_{DE}=0.5, λclf=0.5\lambda_{clf}=0.5).

3. PSL Sampling and In-Batch Optimization Strategy

PSL-based training minimizes memory and computation via focused supervision. At the start of each epoch or regularly, a global ANNS index over {z^l}\{\hat{z}_l\} is refreshed to facilitate mining of hard negatives. For each batch:

  • Up to β\beta positives and η\eta hard negatives are sampled per query.
  • The union of these, along with in-batch positives for all queries, defines LBL_B.
  • Losses LDEL_{DE} and LclfL_{clf} are computed over this reduced pool.
  • The entire model—encoder, both projection heads, and probe classifier—is updated end-to-end with a single backward pass.

Typical batch sizes are QB=512|Q_B|=512–$2048$; LB=1500|L_B|=1500–$4000$. This strategy automatically provides in-batch negatives (from positives of other queries) and keeps per-batch memory and compute practical even as Ltotal1.3ML_{total} \approx 1.3\,\mathrm{M}.

4. Inference and Indexing

At inference time, an augmented nearest neighbor search (ANNS) index is constructed over concatenated label representations {z^lnormalize(Ψl)}\{\hat{z}_l \oplus \mathrm{normalize}(\Psi_l)\}. For each incoming query, the concatenated representation z^qnormalize(hq)\hat{z}_q \oplus \mathrm{normalize}(h_q) is queried against this structure using top-KK retrieval. The retrieved labels are returned as predictions. This unified representation leverages both DE and probe classifier features for robust label ranking.

5. Computational Complexity and Resource Efficiency

The computational cost per batch is O(QBcost(Φ))+O(LBcost(Φ))O(|Q_B| \cdot \mathrm{cost}(\Phi)) + O(|L_B| \cdot \mathrm{cost}(\Phi)). In practice, this translates to orders-of-magnitude reductions in memory and GPU utilization compared with:

  • Traditional one-vs-all OvA classifiers over all L106L \approx 10^6 labels, which require storage and parameter updates for every label.
  • Full-softmax architectures, which are infeasible due to the prohibitive label space.

UniDEC, implementing unified DE and classifier heads with PSL loss, operates end-to-end on a single NVIDIA A6000 GPU for benchmarks that previously required up to 16 A100 GPUs (as with DEXML) or multiple V100s (as with Reneé). Empirically observed reductions in GPU resources and wall-clock time are in the 4×4\times16×16\times range, with precision-at-KK performance matching or exceeding state of the art (Kharbanda et al., 2024).

6. Empirical Benchmarks and Performance

On public XMC benchmarks, such as LF-Amazon-131K, LF-WikiSeeAlso-320K, LF-AmazonTitles-131K, and LF-AmazonTitles-1.3M, UniDEC demonstrates the following:

Benchmark P@1 (UniDEC) P@1 (Best Prior) GPUs (Prior)
LF-Amazon-131K 48.00 48.05 (Reneé) 4×V100
LF-WikiSeeAlso-320K 47.74 47.11 (Dexa(Ψ))
LF-AmazonTitles-1.3M 53.79 55.76 (Dexa(Ψ)) 8×V100

SupconDE with PSL also outperforms NGAME(Φ) and is competitive with Dexa(Φ) despite reduced model depth and embedding dimension. In industrial settings, such as Query2Bid (450M labels), SupconDE achieves P@1=87.33, outperforming NGAME and SimCSE with favorable A/B test results on impression yield and coverage metrics (Kharbanda et al., 2024).

7. Synthesis and Innovations

UniDEC introduces several advances for scalable XMC:

  • Unified, end-to-end trainable DE and probe classifier architecture in which gradients from both heads jointly optimize shared parameters.
  • Adoption of multi-class PAL-N loss reductions via PSL sampling to enable efficient, memory-scalable optimization.
  • Abandonment of meta-classifiers and multi-stage pipelines, obviating multi-step training.
  • Demonstrated resource efficiency, allowing training with orders-of-magnitude fewer hardware requirements than prior SOTA, while outperforming or equaling their precision-at-KK.

A plausible implication is that PSL-based unified DE and probe classifier frameworks present a feasible path for scaling XMC systems to industrially sized label spaces on commodity hardware, while maintaining robust empirical performance (Kharbanda et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual-Encoder and Probe Classifiers.