- The paper demonstrates that dense retrieval performance scales with embedding dimension following a power law relationship.
- It employs BERT and Ettin models across diverse benchmarks to empirically validate the scaling patterns for both in-domain and out-of-domain tasks.
- The study establishes a joint scaling law for model size and embedding dimension, guiding cost-aware and efficient design of dense retrieval systems.
Problem Context and Motivation
The paper "Scaling Laws for Embedding Dimension in Information Retrieval" (2602.05062) rigorously studies the relationship between embedding dimension and retrieval performance in dense neural retrieval systems. As the scope of dense retrieval expands to tasks requiring high expressivity—such as complex instruction following or multi-step reasoning—the adequacy of the underlying fixed-dimensional vector representations, as well as the practice of setting embedding dimension equal to a model’s native transformer hidden size, becomes increasingly suspect. The paper emphasizes that embedding dimension is not merely a resource consideration: there are intrinsic theoretical limitations, such as linear separability of relevant/redundant sets in the embedding space, which scale as a function of dimension.
Consequently, understanding the empirical scaling behaviour of retrieval performance with respect to embedding dimension, and its joint dynamics with model size, is critical for both efficient deployment (impacting storage and compute, especially in settings with large corpora or on-device applications) and theoretical advancement of neural retrieval models.
Experimental Design and Methodology
The study is structured as a comprehensive empirical analysis. Two model families are considered—BERT and Ettin—spanning a range of parameterizations and embedding projections, including extensions above and compressions below the encoder’s native hidden size. BERT models are trained with a hybrid contrastive + knowledge distillation loss on MSMARCO, while Ettin models leverage an instruction-augmented variant (MSMARCO Instruct) and contrastive loss.
The core evaluation metric is contrastive entropy, a continuous proxy for retrieval effectiveness, which is strongly correlated with standard ranking metrics, and offers sufficient granularity for scaling law analysis. Both in-domain tasks (MSMARCO Dev, TREC DL) and out-of-domain benchmarks (CRUMB: Legal QA, Paper Retrieval) are evaluated, to analyze both aligned and non-aligned scaling regimes.
Main Empirical Findings
The results demonstrate that retrieval performance, measured by contrastive entropy, scales with embedding dimension according to a power law, i.e.,
L(D)=DαA​+δ
where A, α, and δ are fit parameters, and L(D) denotes contrastive entropy. This law holds robustly across both model families, various model sizes, and two core IR benchmarks, with high R2 fits indicating strong explanatory power. Notably, for tasks matched to the training distribution, increased embedding dimension leads to continued, but diminishing, performance gains. In regimes where the embedding dimension exceeds the backbone’s native hidden size, further improvements saturate rapidly, and storage/compute returns diminish. When evaluating on unaligned test sets, scaling behaviour becomes less predictable and, for some BERT configurations, over-large embeddings can even yield degraded performance.
A unified scaling law is also derived by introducing model size as a joint variable:
L(D,N)=Dα+NβB​A​+δ
This models the complementarity—and limits—of making up for model parameter restriction via dimensional expansion and vice versa.
Importantly, the study finds that neither model size nor embedding dimension alone compensates well for limitations in the other past a critical point; e.g., even very large embeddings cannot approach the effectiveness of a significantly larger model using typical dimensions, and the same holds vice versa at low embedding sizes.
Cost-Aware Analysis and Practical Implications
By instantiating the joint scaling laws, the paper formalizes cost-aware model selection under fixed computational budgets, accounting for both query encoding and corpus scoring (brute-force or ANN). The analysis reveals that optimal performance requires joint scaling: as FLOP budgets grow, both model capacity and embedding dimension should increase, and the optimal embeddings are often significantly smaller than the hidden size for large corpora (or, in ANN settings, can safely grow larger). The practical consequence is that for real-world deployments, the field’s standard of inheriting the backbone’s embedding size is almost never optimal for effectiveness or efficiency. Furthermore, scaling laws can expedite architecture search by enabling accurate ex ante predictions, obviating exhaustive hyperparameter sweeps.
Out-of-Domain and Ranking Metric Correlation
Scaling patterns are robust in-domain, but for out-of-domain transfer (especially for BERT with knowledge distillation), larger embeddings can have negative or erratic effects, indicating overfitting to teacher behaviour or diminished distributional robustness. This is less pronounced for Ettin models trained directly with contrastive objectives, aligning with other LLM scaling results highlighting the adverse impact of knowledge distillation for generalization. The translation of scaling behaviour from contrastive entropy to discrete IR metrics (RR@10, R@1000) is consistent but, as expected, noisier due to metric discretization.
Theoretical and Future Directions
These empirical results close the gap between abstract geometric limits on embedding-based IR (linear separability, orthogonality bounds, capacity theorems) and practical system design, providing a smooth, quantitative bridge to predict capacity constraints and plan for data/model scaling. The trajectory for further research includes:
- Extending scaling law analysis to sparse representations and understanding the interaction with sparsity ratios.
- Investigating methods and objectives that uniformly utilize the geometric capacity of the expanded embedding space (mitigating the diminishing returns of added dimensions).
- Refining scaling laws in the context of other similarity functions beyond inner product and for multiply negative or compositional tasks.
- Analyzing the role of training distribution alignment for robust generalization in the high-dimension regime.
Conclusion
This study empirically establishes and characterizes the power-law relationship between embedding dimension and dense retrieval effectiveness. It presents strong evidence that both model size and embedding dimension must be jointly balanced for cost-efficient, high-performance dense retrieval systems, and provides explicit scaling laws to facilitate this optimization. The insights have both immediate applied value for system architects and theoretical relevance for understanding representational bottlenecks in neural IR. The framework invites further exploration into sparsity, capacity utilization, and embedding geometry, offering a principled foundation for dense retriever scaling strategies.