Dual-Encoder & Probe Classifiers for XMC
- Dual-encoder and probe classifiers are architectures that map queries and labels into a shared embedding space to rank relevant labels in extremely large label sets.
- Unified loss functions combined with PSL sampling reduce computational costs by focusing on a narrowed label pool while maintaining high precision.
- Empirical benchmarks demonstrate these methods achieve state-of-the-art performance and significant resource efficiency, enabling scalable XMC on commodity hardware.
Dual-encoder (DE) architectures and probe classifiers are critical components for extreme multi-label classification (XMC), where the task is to predict a subset of relevant labels from an extremely large label space by leveraging both input queries and label text features. Traditionally, models have employed dual encoders to map queries and labels into a shared embedding space, often complemented by one-vs-all (OvA) classifiers that rerank shortlisted labels. This paradigm achieves strong empirical performance but has historically suffered from prohibitive computational costs due to the need for loss evaluation across massive label spaces. Recent work such as UniDEC addresses these challenges by unifying dual-encoder and classifier training with multi-class loss formulations and novel sampling strategies that significantly improve resource efficiency and scalability (Kharbanda et al., 2024).
1. Dual-Encoder Architecture for Extreme Multi-Label Classification
A dual-encoder framework for XMC begins with a shared backbone text encoder, denoted , which maps any textual input (query or label) into a -dimensional space: . Two projection heads further process this output:
- DE head: followed by L2 normalization to yield .
- Classifier head: , without normalization.
In the DE tower, both input queries and label texts are encoded via and , producing normalized embeddings and . Similarity is defined by the inner product .
The probe (one-vs-all) classifier head utilizes a learnable lookup matrix , where each row serves as the classification weight vector for label . Query representations are scored against via .
2. Unified Loss Functions and Pick-Some-Labels (PSL) Reduction
Exact loss computation over the entire label space, such as with the PAL-N loss,
is intractable for label sizes . The PSL reduction addresses this by sampling a subset of labels per batch. For each query, up to positives and hard negatives (from an external ANNS index) are selected, and all in-batch positives for other queries are included. The resulting label pool typically includes $1500$–$4000$ labels.
Within this reduced pool, multi-class cross-entropy losses are computed for both DE and classifier heads in symmetric “query-to-label” and “label-to-query” directions:
- For the DE head, PSL loss is
analogous forms apply for both and classifier losses.
- The total unified loss is
where each and is itself a symmetric combination (, ).
3. PSL Sampling and In-Batch Optimization Strategy
PSL-based training minimizes memory and computation via focused supervision. At the start of each epoch or regularly, a global ANNS index over is refreshed to facilitate mining of hard negatives. For each batch:
- Up to positives and hard negatives are sampled per query.
- The union of these, along with in-batch positives for all queries, defines .
- Losses and are computed over this reduced pool.
- The entire model—encoder, both projection heads, and probe classifier—is updated end-to-end with a single backward pass.
Typical batch sizes are –$2048$; –$4000$. This strategy automatically provides in-batch negatives (from positives of other queries) and keeps per-batch memory and compute practical even as .
4. Inference and Indexing
At inference time, an augmented nearest neighbor search (ANNS) index is constructed over concatenated label representations . For each incoming query, the concatenated representation is queried against this structure using top- retrieval. The retrieved labels are returned as predictions. This unified representation leverages both DE and probe classifier features for robust label ranking.
5. Computational Complexity and Resource Efficiency
The computational cost per batch is . In practice, this translates to orders-of-magnitude reductions in memory and GPU utilization compared with:
- Traditional one-vs-all OvA classifiers over all labels, which require storage and parameter updates for every label.
- Full-softmax architectures, which are infeasible due to the prohibitive label space.
UniDEC, implementing unified DE and classifier heads with PSL loss, operates end-to-end on a single NVIDIA A6000 GPU for benchmarks that previously required up to 16 A100 GPUs (as with DEXML) or multiple V100s (as with Reneé). Empirically observed reductions in GPU resources and wall-clock time are in the – range, with precision-at- performance matching or exceeding state of the art (Kharbanda et al., 2024).
6. Empirical Benchmarks and Performance
On public XMC benchmarks, such as LF-Amazon-131K, LF-WikiSeeAlso-320K, LF-AmazonTitles-131K, and LF-AmazonTitles-1.3M, UniDEC demonstrates the following:
| Benchmark | P@1 (UniDEC) | P@1 (Best Prior) | GPUs (Prior) |
|---|---|---|---|
| LF-Amazon-131K | 48.00 | 48.05 (Reneé) | 4×V100 |
| LF-WikiSeeAlso-320K | 47.74 | 47.11 (Dexa(Ψ)) | — |
| LF-AmazonTitles-1.3M | 53.79 | 55.76 (Dexa(Ψ)) | 8×V100 |
SupconDE with PSL also outperforms NGAME(Φ) and is competitive with Dexa(Φ) despite reduced model depth and embedding dimension. In industrial settings, such as Query2Bid (450M labels), SupconDE achieves P@1=87.33, outperforming NGAME and SimCSE with favorable A/B test results on impression yield and coverage metrics (Kharbanda et al., 2024).
7. Synthesis and Innovations
UniDEC introduces several advances for scalable XMC:
- Unified, end-to-end trainable DE and probe classifier architecture in which gradients from both heads jointly optimize shared parameters.
- Adoption of multi-class PAL-N loss reductions via PSL sampling to enable efficient, memory-scalable optimization.
- Abandonment of meta-classifiers and multi-stage pipelines, obviating multi-step training.
- Demonstrated resource efficiency, allowing training with orders-of-magnitude fewer hardware requirements than prior SOTA, while outperforming or equaling their precision-at-.
A plausible implication is that PSL-based unified DE and probe classifier frameworks present a feasible path for scaling XMC systems to industrially sized label spaces on commodity hardware, while maintaining robust empirical performance (Kharbanda et al., 2024).