Scaling Laws for Embedding Dimension in Information Retrieval

Published 4 Feb 2026 in cs.IR and cs.LG | (2602.05062v1)

Abstract: Dense retrieval, which encodes queries and documents into a single dense vector, has become the dominant neural retrieval approach due to its simplicity and compatibility with fast approximate nearest neighbor algorithms. As the tasks dense retrieval performs grow in complexity, the fundamental limitations of the underlying data structure and similarity metric -- namely vectors and inner-products -- become more apparent. Prior recent work has shown theoretical limitations inherent to single vectors and inner-products that are generally tied to the embedding dimension. Given the importance of embedding dimension for retrieval capacity, understanding how dense retrieval performance changes as embedding dimension is scaled is fundamental to building next generation retrieval models that balance effectiveness and efficiency. In this work, we conduct a comprehensive analysis of the relationship between embedding dimension and retrieval performance. Our experiments include two model families and a range of model sizes from each to construct a detailed picture of embedding scaling behavior. We find that the scaling behavior fits a power law, allowing us to derive scaling laws for performance given only embedding dimension, as well as a joint law accounting for embedding dimension and model size. Our analysis shows that for evaluation tasks aligned with the training task, performance continues to improve as embedding size increases, though with diminishing returns. For evaluation data that is less aligned with the training task, we find that performance is less predictable, with performance degrading with larger embedding dimensions for certain tasks. We hope our work provides additional insight into the limitations of embeddings and their behavior as well as offers a practical guide for selecting model and embedding dimension to achieve optimal performance with reduced storage and compute costs.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that dense retrieval performance scales with embedding dimension following a power law relationship.
It employs BERT and Ettin models across diverse benchmarks to empirically validate the scaling patterns for both in-domain and out-of-domain tasks.
The study establishes a joint scaling law for model size and embedding dimension, guiding cost-aware and efficient design of dense retrieval systems.

Scaling Laws for Embedding Dimension in Dense Information Retrieval

Problem Context and Motivation

The paper "Scaling Laws for Embedding Dimension in Information Retrieval" (2602.05062) rigorously studies the relationship between embedding dimension and retrieval performance in dense neural retrieval systems. As the scope of dense retrieval expands to tasks requiring high expressivity—such as complex instruction following or multi-step reasoning—the adequacy of the underlying fixed-dimensional vector representations, as well as the practice of setting embedding dimension equal to a model’s native transformer hidden size, becomes increasingly suspect. The paper emphasizes that embedding dimension is not merely a resource consideration: there are intrinsic theoretical limitations, such as linear separability of relevant/redundant sets in the embedding space, which scale as a function of dimension.

Consequently, understanding the empirical scaling behaviour of retrieval performance with respect to embedding dimension, and its joint dynamics with model size, is critical for both efficient deployment (impacting storage and compute, especially in settings with large corpora or on-device applications) and theoretical advancement of neural retrieval models.

Experimental Design and Methodology

The study is structured as a comprehensive empirical analysis. Two model families are considered—BERT and Ettin—spanning a range of parameterizations and embedding projections, including extensions above and compressions below the encoder’s native hidden size. BERT models are trained with a hybrid contrastive + knowledge distillation loss on MSMARCO, while Ettin models leverage an instruction-augmented variant (MSMARCO Instruct) and contrastive loss.

The core evaluation metric is contrastive entropy, a continuous proxy for retrieval effectiveness, which is strongly correlated with standard ranking metrics, and offers sufficient granularity for scaling law analysis. Both in-domain tasks (MSMARCO Dev, TREC DL) and out-of-domain benchmarks (CRUMB: Legal QA, Paper Retrieval) are evaluated, to analyze both aligned and non-aligned scaling regimes.

Main Empirical Findings

The results demonstrate that retrieval performance, measured by contrastive entropy, scales with embedding dimension according to a power law, i.e.,

$L(D) = \frac{A}{D^\alpha} + \delta$

where $A$ , $\alpha$ , and $\delta$ are fit parameters, and $L(D)$ denotes contrastive entropy. This law holds robustly across both model families, various model sizes, and two core IR benchmarks, with high $R^2$ fits indicating strong explanatory power. Notably, for tasks matched to the training distribution, increased embedding dimension leads to continued, but diminishing, performance gains. In regimes where the embedding dimension exceeds the backbone’s native hidden size, further improvements saturate rapidly, and storage/compute returns diminish. When evaluating on unaligned test sets, scaling behaviour becomes less predictable and, for some BERT configurations, over-large embeddings can even yield degraded performance.

A unified scaling law is also derived by introducing model size as a joint variable:

$L(D, N) = \frac{A}{ D^\alpha + \frac{B}{N^\beta} } + \delta$

This models the complementarity—and limits—of making up for model parameter restriction via dimensional expansion and vice versa.

Importantly, the study finds that neither model size nor embedding dimension alone compensates well for limitations in the other past a critical point; e.g., even very large embeddings cannot approach the effectiveness of a significantly larger model using typical dimensions, and the same holds vice versa at low embedding sizes.

Cost-Aware Analysis and Practical Implications

By instantiating the joint scaling laws, the paper formalizes cost-aware model selection under fixed computational budgets, accounting for both query encoding and corpus scoring (brute-force or ANN). The analysis reveals that optimal performance requires joint scaling: as FLOP budgets grow, both model capacity and embedding dimension should increase, and the optimal embeddings are often significantly smaller than the hidden size for large corpora (or, in ANN settings, can safely grow larger). The practical consequence is that for real-world deployments, the field’s standard of inheriting the backbone’s embedding size is almost never optimal for effectiveness or efficiency. Furthermore, scaling laws can expedite architecture search by enabling accurate ex ante predictions, obviating exhaustive hyperparameter sweeps.

Out-of-Domain and Ranking Metric Correlation

Scaling patterns are robust in-domain, but for out-of-domain transfer (especially for BERT with knowledge distillation), larger embeddings can have negative or erratic effects, indicating overfitting to teacher behaviour or diminished distributional robustness. This is less pronounced for Ettin models trained directly with contrastive objectives, aligning with other LLM scaling results highlighting the adverse impact of knowledge distillation for generalization. The translation of scaling behaviour from contrastive entropy to discrete IR metrics (RR@10, R@1000) is consistent but, as expected, noisier due to metric discretization.

Theoretical and Future Directions

These empirical results close the gap between abstract geometric limits on embedding-based IR (linear separability, orthogonality bounds, capacity theorems) and practical system design, providing a smooth, quantitative bridge to predict capacity constraints and plan for data/model scaling. The trajectory for further research includes:

Extending scaling law analysis to sparse representations and understanding the interaction with sparsity ratios.
Investigating methods and objectives that uniformly utilize the geometric capacity of the expanded embedding space (mitigating the diminishing returns of added dimensions).
Refining scaling laws in the context of other similarity functions beyond inner product and for multiply negative or compositional tasks.
Analyzing the role of training distribution alignment for robust generalization in the high-dimension regime.

Conclusion

This study empirically establishes and characterizes the power-law relationship between embedding dimension and dense retrieval effectiveness. It presents strong evidence that both model size and embedding dimension must be jointly balanced for cost-efficient, high-performance dense retrieval systems, and provides explicit scaling laws to facilitate this optimization. The insights have both immediate applied value for system architects and theoretical relevance for understanding representational bottlenecks in neural IR. The framework invites further exploration into sparsity, capacity utilization, and embedding geometry, offering a principled foundation for dense retriever scaling strategies.

Markdown Report Issue