Dual-Encoder & Probe Classifiers for XMC

Updated 21 January 2026

Dual-encoder and probe classifiers are architectures that map queries and labels into a shared embedding space to rank relevant labels in extremely large label sets.
Unified loss functions combined with PSL sampling reduce computational costs by focusing on a narrowed label pool while maintaining high precision.
Empirical benchmarks demonstrate these methods achieve state-of-the-art performance and significant resource efficiency, enabling scalable XMC on commodity hardware.

Dual-encoder (DE) architectures and probe classifiers are critical components for extreme multi-label classification (XMC), where the task is to predict a subset of relevant labels from an extremely large label space by leveraging both input queries and label text features. Traditionally, models have employed dual encoders to map queries and labels into a shared embedding space, often complemented by one-vs-all (OvA) classifiers that rerank shortlisted labels. This paradigm achieves strong empirical performance but has historically suffered from prohibitive computational costs due to the need for loss evaluation across massive label spaces. Recent work such as UniDEC addresses these challenges by unifying dual-encoder and classifier training with multi-class loss formulations and novel sampling strategies that significantly improve resource efficiency and scalability (Kharbanda et al., 2024).

1. Dual-Encoder Architecture for Extreme Multi-Label Classification

A dual-encoder framework for XMC begins with a shared backbone text encoder, denoted $\Phi$ , which maps any textual input (query or label) $t$ into a $d_\Phi$ -dimensional space: $\Phi(t) \in \mathbb{R}^{d_\Phi}$ . Two projection heads further process this output:

DE head: $z = g_1(\Phi(t)) \in \mathbb{R}^{d}$ followed by L2 normalization to yield $\hat{z} = z/\|z\|$ .
Classifier head: $h = g_2(\Phi(t)) \in \mathbb{R}^{d}$ , without normalization.

In the DE tower, both input queries $x$ and label texts $label\_text_l$ are encoded via $\Phi$ and $g_1$ , producing normalized embeddings $\hat{z}_q$ and $\hat{z}_l$ . Similarity is defined by the inner product $s_{q2l}(i,l)=\langle \hat{z}_{q_i}, \hat{z}_l \rangle$ .

The probe (one-vs-all) classifier head utilizes a learnable lookup matrix $\Psi \in \mathbb{R}^{L \times d}$ , where each row $\Psi_l$ serves as the classification weight vector for label $l$ . Query representations $h_{q_i}$ are scored against $\Psi_l$ via $s_{clf}(i,l)=\langle h_{q_i}, \Psi_l \rangle$ .

2. Unified Loss Functions and Pick-Some-Labels (PSL) Reduction

Exact loss computation over the entire label space, such as with the PAL-N loss,

$L_{PAL-N}(x_i) = -\frac{1}{|P_i|} \sum_{l \in P_i} \log\left(\frac{\exp(s(i,l)/\tau)}{\sum_{k=1}^L \exp(s(i,k)/\tau)}\right)$

is intractable for label sizes $L \sim 10^6$ . The PSL reduction addresses this by sampling a subset of labels per batch. For each query, up to $\beta$ positives and $\eta$ hard negatives (from an external ANNS index) are selected, and all in-batch positives for other queries are included. The resulting label pool $L_B$ typically includes $1500$–$4000$ labels.

Within this reduced pool, multi-class cross-entropy losses are computed for both DE and classifier heads in symmetric “query-to-label” and “label-to-query” directions:

For the DE head, PSL loss is

$L_{q2l}^{DE} = \sum_{i \in Q_B} \left[-\frac{1}{|\mathcal{P}_i^B|} \sum_{p \in \mathcal{P}_i^B} \log \frac{\exp(\langle \hat{z}_{q_i}, \hat{z}_{l_p}\rangle/\tau)}{\sum_{l \in L_B} \exp(\langle \hat{z}_{q_i}, \hat{z}_l\rangle/\tau)}\right]$

analogous forms apply for both $L_{l2q}^{DE}$ and classifier losses.

The total unified loss is

$L = \lambda \cdot L_{DE} + (1-\lambda) \cdot L_{clf}$

where each $L_{DE}$ and $L_{clf}$ is itself a symmetric combination ( $\lambda_{DE}=0.5$ , $\lambda_{clf}=0.5$ ).

3. PSL Sampling and In-Batch Optimization Strategy

PSL-based training minimizes memory and computation via focused supervision. At the start of each epoch or regularly, a global ANNS index over $\{\hat{z}_l\}$ is refreshed to facilitate mining of hard negatives. For each batch:

Up to $\beta$ positives and $\eta$ hard negatives are sampled per query.
The union of these, along with in-batch positives for all queries, defines $L_B$ .
Losses $L_{DE}$ and $L_{clf}$ are computed over this reduced pool.
The entire model—encoder, both projection heads, and probe classifier—is updated end-to-end with a single backward pass.

Typical batch sizes are $|Q_B|=512$ –$2048$; $|L_B|=1500$ –$4000$. This strategy automatically provides in-batch negatives (from positives of other queries) and keeps per-batch memory and compute practical even as $L_{total} \approx 1.3\,\mathrm{M}$ .

4. Inference and Indexing

At inference time, an augmented nearest neighbor search (ANNS) index is constructed over concatenated label representations $\{\hat{z}_l \oplus \mathrm{normalize}(\Psi_l)\}$ . For each incoming query, the concatenated representation $\hat{z}_q \oplus \mathrm{normalize}(h_q)$ is queried against this structure using top- $K$ retrieval. The retrieved labels are returned as predictions. This unified representation leverages both DE and probe classifier features for robust label ranking.

5. Computational Complexity and Resource Efficiency

The computational cost per batch is $O(|Q_B| \cdot \mathrm{cost}(\Phi)) + O(|L_B| \cdot \mathrm{cost}(\Phi))$ . In practice, this translates to orders-of-magnitude reductions in memory and GPU utilization compared with:

Traditional one-vs-all OvA classifiers over all $L \approx 10^6$ labels, which require storage and parameter updates for every label.
Full-softmax architectures, which are infeasible due to the prohibitive label space.

UniDEC, implementing unified DE and classifier heads with PSL loss, operates end-to-end on a single NVIDIA A6000 GPU for benchmarks that previously required up to 16 A100 GPUs (as with DEXML) or multiple V100s (as with Reneé). Empirically observed reductions in GPU resources and wall-clock time are in the $4\times$ – $16\times$ range, with precision-at- $K$ performance matching or exceeding state of the art (Kharbanda et al., 2024).

6. Empirical Benchmarks and Performance

On public XMC benchmarks, such as LF-Amazon-131K, LF-WikiSeeAlso-320K, LF-AmazonTitles-131K, and LF-AmazonTitles-1.3M, UniDEC demonstrates the following:

Benchmark	P@1 (UniDEC)	P@1 (Best Prior)	GPUs (Prior)
LF-Amazon-131K	48.00	48.05 (Reneé)	4×V100
LF-WikiSeeAlso-320K	47.74	47.11 (Dexa(Ψ))	—
LF-AmazonTitles-1.3M	53.79	55.76 (Dexa(Ψ))	8×V100

SupconDE with PSL also outperforms NGAME(Φ) and is competitive with Dexa(Φ) despite reduced model depth and embedding dimension. In industrial settings, such as Query2Bid (450M labels), SupconDE achieves P@1=87.33, outperforming NGAME and SimCSE with favorable A/B test results on impression yield and coverage metrics (Kharbanda et al., 2024).

7. Synthesis and Innovations

UniDEC introduces several advances for scalable XMC:

Unified, end-to-end trainable DE and probe classifier architecture in which gradients from both heads jointly optimize shared parameters.
Adoption of multi-class PAL-N loss reductions via PSL sampling to enable efficient, memory-scalable optimization.
Abandonment of meta-classifiers and multi-stage pipelines, obviating multi-step training.
Demonstrated resource efficiency, allowing training with orders-of-magnitude fewer hardware requirements than prior SOTA, while outperforming or equaling their precision-at- $K$ .

A plausible implication is that PSL-based unified DE and probe classifier frameworks present a feasible path for scaling XMC systems to industrially sized label spaces on commodity hardware, while maintaining robust empirical performance (Kharbanda et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

UniDEC : Unified Dual Encoder and Classifier Training for Extreme Multi-Label Classification (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual-Encoder and Probe Classifiers.

Dual-Encoder & Probe Classifiers for XMC

1. Dual-Encoder Architecture for Extreme Multi-Label Classification

2. Unified Loss Functions and Pick-Some-Labels (PSL) Reduction

3. PSL Sampling and In-Batch Optimization Strategy

4. Inference and Indexing

5. Computational Complexity and Resource Efficiency

6. Empirical Benchmarks and Performance

7. Synthesis and Innovations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Dual-Encoder & Probe Classifiers for XMC

1. Dual-Encoder Architecture for Extreme Multi-Label Classification

2. Unified Loss Functions and Pick-Some-Labels (PSL) Reduction

3. PSL Sampling and In-Batch Optimization Strategy

4. Inference and Indexing

5. Computational Complexity and Resource Efficiency

6. Empirical Benchmarks and Performance

7. Synthesis and Innovations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research