Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Inverted Softmax (DIS)

Updated 31 January 2026
  • Dynamic Inverted Softmax (DIS) is an adaptive normalization technique that dynamically reweights similarities to mitigate hubness and boost retrieval accuracy in cross-modal tasks.
  • DIS leverages codeword-structured, query-dependent sampling in large-class softmax, reducing computational cost and bias while enabling efficient sublinear sampling.
  • DIS is integrated into end-to-end systems to improve robustness under distributional shifts, achieving significant speedups and enhanced performance in extreme classification scenarios.

Dynamic Inverted Softmax (DIS) encompasses a set of adaptive normalization techniques designed to improve both retrieval accuracy and computational efficiency in high-dimensional embedding spaces. DIS is prominent in two central application domains: Robust cross-modal retrieval with hubness mitigation as formalized in Querybank Normalisation (QB-Norm), and large-scale learning/inference acceleration via adaptive sampled softmax algorithms such as the MIDX-Sampler. Both methods employ dynamic, query-dependent mechanisms to suppress hubs selectively or to construct efficient proposal distributions for importance sampling, yielding robust performance under distributional shift and extreme data cardinality.

1. Mathematical Formulation and Derivation

The DIS mechanism was introduced in two distinct but conceptually aligned modalities: embedding space normalisation for cross-modal retrieval (Bogolin et al., 2021) and adaptive sampled softmax for extreme classification and large-scale neural models (Chen et al., 15 Jan 2025).

DIS in Cross-Modal Retrieval

Given encoders fq:mqRCf_q : m_q \to \mathbb{R}^C and fg:mgRCf_g : m_g \to \mathbb{R}^C mapping queries and gallery items into a joint CC-dimensional space, the base similarity is

sq(j)=sim(fq(q),fg(gj))s_q(j) = \mathrm{sim}(f_q(q), f_g(g_j))

A querybank B={bi}i=1NB = \{b_i\}_{i=1}^N of NN probes is constructed. Each gallery item gjg_j receives a probe vector pj(i)=sim(fq(bi),fg(gj))p_j(i) = \mathrm{sim}(f_q(b_i), f_g(g_j)). The static Inverted Softmax (IS) normalises gallery similarities as

ηqIS(j)=exp(βsq(j))i=1Nexp(βpj(i))\eta_q^{\mathrm{IS}}(j) = \frac{\exp(\beta s_q(j))}{\sum_{i=1}^N \exp(\beta p_j(i))}

DIS introduces a data-driven activation set A\mathcal{A} comprising gallery items receiving top-kk votes by the querybank probes. For a given query, DIS reweights only if the top raw-similarity hit j=argmaxjsq(j)j^* = \arg\max_j s_q(j) is itself a hub (jAj^* \in \mathcal{A}):

ηqDIS(j)={exp(βsq(j))i=1Nexp(βpj(i))jA sq(j)otherwise\eta_q^{\mathrm{DIS}}(j) = \begin{cases} \frac{\exp(\beta s_q(j))}{\sum_{i=1}^N \exp(\beta p_j(i))} & j^* \in \mathcal{A} \ s_q(j) & \text{otherwise} \end{cases}

DIS in Adaptive Sampled Softmax (MIDX-Sampler)

For large-class softmax, let queries be zRDz \in \mathbb{R}^D, classes qiRDq_i \in \mathbb{R}^D, and scores oi=zqio_i = z^\top q_i. The ideal softmax is

p(iz)=exp(oi)j=1Nexp(oj)p(i|z) = \frac{\exp(o_i)}{\sum_{j=1}^N \exp(o_j)}

To enable efficient sampling, DIS decomposes p(iz)p(i|z) via product or residual quantization. With KK codewords per block and BB blocks, each class is mapped to (k1,k2)(k_1, k_2), and Ωk1,k2\Omega_{k_1,k_2} contains class indices sharing codeword assignments. The dynamic (DIS) proposal distribution replaces the residual softmax by a uniform:

Qmidx(iz)=Pz1(k1(i))Pz2(k2(i)k1(i))1Ωk1,k2Q_{midx}(i|z) = P^1_z(k_1(i)) \cdot P^2_z(k_2(i)|k_1(i)) \cdot \frac{1}{|\Omega_{k_1,k_2}|}

Sampling and importance weighting are done sublinearly in NN.

2. Comparison of DIS to Static Methods

DIS techniques provide targeted hub penalisation and adaptivity not present in their static analogues:

  • Static IS divides all similarities by a hubness denominator for every gallery, regardless of actual hub status, often resulting in over-penalisation for poor or mismatched querybanks (Bogolin et al., 2021).
  • DIS (QB-Norm) applies softmax only when a query's top result is a flaggable hub; otherwise, raw scoring is retained, increasing robustness to querybank/galleries with domain mismatch or low overlap.
  • Static proposals in sampled softmax (uniform, unigram) exhibit large KL and gradient biases, poor convergence, and slow optimisation (Chen et al., 15 Jan 2025). DIS via MIDX-Sampler achieves codeword-structured adaptivity, reducing bias and preserving computational efficiency.

3. Algorithms and Implementation

QB-Norm DIS Inference Pipeline

Precompute probe similarities and the activation set A\mathcal{A}. At inference:

1
2
3
4
5
6
7
8
for query in queries:
    s = [sim(f_q(query), f_g(g_j)) for g_j in gallery]
    j_star = argmax_j s[j]
    if j_star in activation_set:
        eta = [exp(beta * s[j]) / denom[j] for j in gallery]
    else:
        eta = s
    ranking = argsort_descending(eta)

MIDX-Sampler DIS Workflow

Construct product/residual quantization codebooks. For each query, sample negatives per codeword distribution and evaluate the dynamic proposal Qmidx(iz)Q_{midx}(i|z):

1
2
3
4
5
6
for t in range(M):
    k1 = sample(P1_z)
    k2 = sample(P2_z_given_k1)
    i = uniform(Ω_k1_k2)
    Q = P1_z[k1] * P2_z[k2|k1] / len(Ω_k1_k2)
    record(i, Q)

4. Hyperparameters and Selection Criteria

Key DIS parameters determine both performance and generalisation properties:

Hyperparameter Role in QB-Norm DIS(Bogolin et al., 2021) Role in MIDX-Sampler DIS(Chen et al., 15 Jan 2025)
Inverse temperature (β\beta) Controls peakiness; optimal around 20 (or $1/1.99$ for CLIP models) Implicit in codeword inner products capacity
Querybank size (NN) $5000$–$10,000$ probes; saturates at N10,000N \sim 10,000 Codebook size KK (balancing distortion and speed)
Top-kk selection k=1k=1 gives maximal robustness Not applicable; uniform sampling at codeword level
Similarity measure Cosine similarity in RC\mathbb{R}^C Inner product in RD\mathbb{R}^D

Selection is typically done via held-out validation; increasing NN or KK reduces hubness and quantization distortion but incurs additional memory.

5. Integration in End-to-End Systems

QB-Norm Pipeline

  • Training: Encoders fqf_q, fgf_g learned via standard ranking/contrastive loss.
  • Indexing: Gallery embeddings and querybank fixed; probe similarities and activation set precomputed.
  • Inference: Queries scored and ranked via DIS; no retraining or parameter updates required.

Sampled Softmax for Large-Scale Learning

  • Index construction: Product/residual quantization codebooks computed offline.
  • Proposal decomposition: DIS enables sublinear sampling and softmax approximation, reducing both bias and compute.
  • Training/inference: Gradient estimation and prediction performed on dynamically sampled negatives.

6. Empirical Performance and Robustness

  • Video-text retrieval (MSR-VTT): R@1 improves from 29.633.329.6 \rightarrow 33.3; CLIP2Video improves R@1 45.647.245.6 \rightarrow 47.2.
  • Recall increases: Across six video-text datasets, all recall metrics rise by $1$–$5$ points.
  • Image/text/audio tasks: CLIP zero-shot on MSCOCO R@1 improves 37.841.437.8 \rightarrow 41.4; text-audio retrieval rises 23.123.923.1 \rightarrow 23.9.
  • Hubness reduction: Example: MSR-VTT kk-occurrence skewness drops 0.940.510.94 \rightarrow 0.51.

Robustness to Distributional Shift:

  • DIS preserves baseline performance under far-domain querybanks, unlike static IS or CSLS which collapse (Bogolin et al., 2021).
  • Ablations show smooth degradation away from β20\beta \sim 20, querybank size saturation, and maximum robustness at k=1k=1.
  • Sampling cost: O(KD+K2+M)O(KD + K^2 + M), sublinear in number of classes.
  • Bias/convergence: KL-divergence and gradient bias provably bounded; smaller residual distortion o~\|\tilde o\|_\infty leads to superior generalisation and faster convergence.
  • Empirical speedup: MIDX-Sampler matches full softmax in perplexity, ranking, and recall with 5×5\times10×10\times sampling and $10$–100×100\times memory improvements over kernel methods.

7. Context, Significance, and Current Challenges

Dynamic Inverted Softmax provides modular, data-driven normalisation strategies that bypass limitations of static global transformations. In retrieval, DIS robustly demotes hubs only where necessary, preserving performance under query/resource limitations and domain mismatch. In sampled softmax, DIS enables scalable, theoretically justified learning for extreme class cardinality. These mechanisms require no retraining or backpropagation through the normaliser, making them practical augmentations for existing neural models.

A plausible implication is that future extensions may leverage learnable codeword assignment and additional adaptivity in sampling or hub detection to further reduce bias and accelerate convergence. Current limitations include memory footprint for large querybanks/codebooks and sensitivity to quantization or probe selection. Nevertheless, the empirical consistency and theoretical foundation make DIS central to both retrieval and extreme learning architectures (Bogolin et al., 2021, Chen et al., 15 Jan 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Inverted Softmax (DIS).