Dynamic Inverted Softmax (DIS)

Updated 31 January 2026

Dynamic Inverted Softmax (DIS) is an adaptive normalization technique that dynamically reweights similarities to mitigate hubness and boost retrieval accuracy in cross-modal tasks.
DIS leverages codeword-structured, query-dependent sampling in large-class softmax, reducing computational cost and bias while enabling efficient sublinear sampling.
DIS is integrated into end-to-end systems to improve robustness under distributional shifts, achieving significant speedups and enhanced performance in extreme classification scenarios.

Dynamic Inverted Softmax (DIS) encompasses a set of adaptive normalization techniques designed to improve both retrieval accuracy and computational efficiency in high-dimensional embedding spaces. DIS is prominent in two central application domains: Robust cross-modal retrieval with hubness mitigation as formalized in Querybank Normalisation (QB-Norm), and large-scale learning/inference acceleration via adaptive sampled softmax algorithms such as the MIDX-Sampler. Both methods employ dynamic, query-dependent mechanisms to suppress hubs selectively or to construct efficient proposal distributions for importance sampling, yielding robust performance under distributional shift and extreme data cardinality.

1. Mathematical Formulation and Derivation

The DIS mechanism was introduced in two distinct but conceptually aligned modalities: embedding space normalisation for cross-modal retrieval (Bogolin et al., 2021) and adaptive sampled softmax for extreme classification and large-scale neural models (Chen et al., 15 Jan 2025).

Given encoders $f_q : m_q \to \mathbb{R}^C$ and $f_g : m_g \to \mathbb{R}^C$ mapping queries and gallery items into a joint $C$ -dimensional space, the base similarity is

$s_q(j) = \mathrm{sim}(f_q(q), f_g(g_j))$

A querybank $B = \{b_i\}_{i=1}^N$ of $N$ probes is constructed. Each gallery item $g_j$ receives a probe vector $p_j(i) = \mathrm{sim}(f_q(b_i), f_g(g_j))$ . The static Inverted Softmax (IS) normalises gallery similarities as

$\eta_q^{\mathrm{IS}}(j) = \frac{\exp(\beta s_q(j))}{\sum_{i=1}^N \exp(\beta p_j(i))}$

DIS introduces a data-driven activation set $\mathcal{A}$ comprising gallery items receiving top- $k$ votes by the querybank probes. For a given query, DIS reweights only if the top raw-similarity hit $j^* = \arg\max_j s_q(j)$ is itself a hub ( $j^* \in \mathcal{A}$ ):

$\eta_q^{\mathrm{DIS}}(j) = \begin{cases} \frac{\exp(\beta s_q(j))}{\sum_{i=1}^N \exp(\beta p_j(i))} & j^* \in \mathcal{A} \ s_q(j) & \text{otherwise} \end{cases}$

DIS in Adaptive Sampled Softmax (MIDX-Sampler)

For large-class softmax, let queries be $z \in \mathbb{R}^D$ , classes $q_i \in \mathbb{R}^D$ , and scores $o_i = z^\top q_i$ . The ideal softmax is

$p(i|z) = \frac{\exp(o_i)}{\sum_{j=1}^N \exp(o_j)}$

To enable efficient sampling, DIS decomposes $p(i|z)$ via product or residual quantization. With $K$ codewords per block and $B$ blocks, each class is mapped to $(k_1, k_2)$ , and $\Omega_{k_1,k_2}$ contains class indices sharing codeword assignments. The dynamic (DIS) proposal distribution replaces the residual softmax by a uniform:

$Q_{midx}(i|z) = P^1_z(k_1(i)) \cdot P^2_z(k_2(i)|k_1(i)) \cdot \frac{1}{|\Omega_{k_1,k_2}|}$

Sampling and importance weighting are done sublinearly in $N$ .

2. Comparison of DIS to Static Methods

DIS techniques provide targeted hub penalisation and adaptivity not present in their static analogues:

Static IS divides all similarities by a hubness denominator for every gallery, regardless of actual hub status, often resulting in over-penalisation for poor or mismatched querybanks (Bogolin et al., 2021).
DIS (QB-Norm) applies softmax only when a query's top result is a flaggable hub; otherwise, raw scoring is retained, increasing robustness to querybank/galleries with domain mismatch or low overlap.
Static proposals in sampled softmax (uniform, unigram) exhibit large KL and gradient biases, poor convergence, and slow optimisation (Chen et al., 15 Jan 2025). DIS via MIDX-Sampler achieves codeword-structured adaptivity, reducing bias and preserving computational efficiency.

3. Algorithms and Implementation

QB-Norm DIS Inference Pipeline

Precompute probe similarities and the activation set $\mathcal{A}$ . At inference:

for query in queries:
    s = [sim(f_q(query), f_g(g_j)) for g_j in gallery]
    j_star = argmax_j s[j]
    if j_star in activation_set:
        eta = [exp(beta * s[j]) / denom[j] for j in gallery]
    else:
        eta = s
    ranking = argsort_descending(eta)

MIDX-Sampler DIS Workflow

Construct product/residual quantization codebooks. For each query, sample negatives per codeword distribution and evaluate the dynamic proposal $Q_{midx}(i|z)$ :

for t in range(M):
    k1 = sample(P1_z)
    k2 = sample(P2_z_given_k1)
    i = uniform(Ω_k1_k2)
    Q = P1_z[k1] * P2_z[k2|k1] / len(Ω_k1_k2)
    record(i, Q)

4. Hyperparameters and Selection Criteria

Key DIS parameters determine both performance and generalisation properties:

Hyperparameter	Role in QB-Norm DIS(Bogolin et al., 2021)	Role in MIDX-Sampler DIS(Chen et al., 15 Jan 2025)
Inverse temperature ( $\beta$ )	Controls peakiness; optimal around 20 (or $1/1.99$ for CLIP models)	Implicit in codeword inner products capacity
Querybank size ( $N$ )	$5000$–$10,000$ probes; saturates at $N \sim 10,000$	Codebook size $K$ (balancing distortion and speed)
Top- $k$ selection	$k=1$ gives maximal robustness	Not applicable; uniform sampling at codeword level
Similarity measure	Cosine similarity in $\mathbb{R}^C$	Inner product in $\mathbb{R}^D$

Selection is typically done via held-out validation; increasing $N$ or $K$ reduces hubness and quantization distortion but incurs additional memory.

5. Integration in End-to-End Systems

QB-Norm Pipeline

Training: Encoders $f_q$ , $f_g$ learned via standard ranking/contrastive loss.
Indexing: Gallery embeddings and querybank fixed; probe similarities and activation set precomputed.
Inference: Queries scored and ranked via DIS; no retraining or parameter updates required.

Sampled Softmax for Large-Scale Learning

Index construction: Product/residual quantization codebooks computed offline.
Proposal decomposition: DIS enables sublinear sampling and softmax approximation, reducing both bias and compute.
Training/inference: Gradient estimation and prediction performed on dynamically sampled negatives.

6. Empirical Performance and Robustness

Video-text retrieval (MSR-VTT): R@1 improves from $29.6 \rightarrow 33.3$ ; CLIP2Video improves R@1 $45.6 \rightarrow 47.2$ .
Recall increases: Across six video-text datasets, all recall metrics rise by $1$–$5$ points.
Image/text/audio tasks: CLIP zero-shot on MSCOCO R@1 improves $37.8 \rightarrow 41.4$ ; text-audio retrieval rises $23.1 \rightarrow 23.9$ .
Hubness reduction: Example: MSR-VTT $k$ -occurrence skewness drops $0.94 \rightarrow 0.51$ .

Robustness to Distributional Shift:

DIS preserves baseline performance under far-domain querybanks, unlike static IS or CSLS which collapse (Bogolin et al., 2021).
Ablations show smooth degradation away from $\beta \sim 20$ , querybank size saturation, and maximum robustness at $k=1$ .

Sampling cost: $O(KD + K^2 + M)$ , sublinear in number of classes.
Bias/convergence: KL-divergence and gradient bias provably bounded; smaller residual distortion $\|\tilde o\|_\infty$ leads to superior generalisation and faster convergence.
Empirical speedup: MIDX-Sampler matches full softmax in perplexity, ranking, and recall with $5\times$ – $10\times$ sampling and $10$– $100\times$ memory improvements over kernel methods.

7. Context, Significance, and Current Challenges

Dynamic Inverted Softmax provides modular, data-driven normalisation strategies that bypass limitations of static global transformations. In retrieval, DIS robustly demotes hubs only where necessary, preserving performance under query/resource limitations and domain mismatch. In sampled softmax, DIS enables scalable, theoretically justified learning for extreme class cardinality. These mechanisms require no retraining or backpropagation through the normaliser, making them practical augmentations for existing neural models.

A plausible implication is that future extensions may leverage learnable codeword assignment and additional adaptivity in sampling or hub detection to further reduce bias and accelerate convergence. Current limitations include memory footprint for large querybanks/codebooks and sensitivity to quantization or probe selection. Nevertheless, the empirical consistency and theoretical foundation make DIS central to both retrieval and extreme learning architectures (Bogolin et al., 2021, Chen et al., 15 Jan 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Cross Modal Retrieval with Querybank Normalisation (2021)

Adaptive Sampled Softmax with Inverted Multi-Index: Methods, Theory and Applications (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Inverted Softmax (DIS).

Dynamic Inverted Softmax (DIS)

1. Mathematical Formulation and Derivation

DIS in Adaptive Sampled Softmax (MIDX-Sampler)

2. Comparison of DIS to Static Methods

3. Algorithms and Implementation

QB-Norm DIS Inference Pipeline

MIDX-Sampler DIS Workflow

4. Hyperparameters and Selection Criteria

5. Integration in End-to-End Systems

QB-Norm Pipeline

Sampled Softmax for Large-Scale Learning

6. Empirical Performance and Robustness

Retrieval tasks (Bogolin et al., 2021):

Robustness to Distributional Shift:

Large-Scale Learning (MIDX-Sampler) (Chen et al., 15 Jan 2025):

7. Context, Significance, and Current Challenges

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Dynamic Inverted Softmax (DIS)

1. Mathematical Formulation and Derivation

DIS in Cross-Modal Retrieval

DIS in Adaptive Sampled Softmax (MIDX-Sampler)

2. Comparison of DIS to Static Methods

3. Algorithms and Implementation

QB-Norm DIS Inference Pipeline

MIDX-Sampler DIS Workflow

4. Hyperparameters and Selection Criteria

5. Integration in End-to-End Systems

QB-Norm Pipeline

Sampled Softmax for Large-Scale Learning

6. Empirical Performance and Robustness

Retrieval tasks (Bogolin et al., 2021):

Robustness to Distributional Shift:

Large-Scale Learning (MIDX-Sampler) (Chen et al., 15 Jan 2025):

7. Context, Significance, and Current Challenges

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics