To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining

Published 1 Apr 2026 in cs.CL, cs.AI, and cs.LG | (2604.00715v1)

Abstract: Retrieval-augmented generation (RAG) improves LLM (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM data, while varying both pretraining data scale (1-150x the number of parameters) and retrieval store size (1-20x), and evaluate performance across a diverse suite of benchmarks spanning reasoning, scientific QA, and open-domain QA. We find that retrieval consistently improves performance over parametric-only baselines across model scales and introduce a three-dimensional scaling framework that models performance as a function of model size, pretraining tokens, and retrieval corpus size. This scaling manifold enables us to estimate optimal allocations of a fixed data budget between pretraining and retrieval, revealing that the marginal utility of retrieval depends strongly on model scale, task type, and the degree of pretraining saturation. Our results provide a quantitative foundation for understanding when and how retrieval should complement pretraining, offering practical guidance for allocating data resources in the design of scalable language modeling systems.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper establishes a 3D scaling law framework that quantifies the trade-off between pretraining tokens and retrieval data in language models.
It demonstrates that retrieval is most effective in low-to-mid pretraining regimes, with diminishing marginal gains as model size increases.
The study provides practical guidelines for corpus allocation by highlighting the substitutability between pretraining and retrieval under fixed data constraints.

Scaling Laws and Data Allocation in RAG-Aware Pretraining

Introduction

The paper "To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining" (2604.00715) establishes a quantitative framework for optimizing the trade-off between parametric (memorization via pretraining) and non-parametric (retrieval-augmented) knowledge in LLM training under fixed data constraints. By systematically varying model size, pretraining token counts, and external retrieval corpus scale across OLMo-2-based LMs (30M–3B parameters), the authors elucidate the conditions under which retrieval can substitute for pretraining and provide task-specific guidance for RAG-aware LM development.

Scaling Laws for Parametric and Retrieval-Augmented Pretraining

The empirical results demonstrate canonical power-law scaling for parametric pretraining: increasing model parameters and training tokens reliably reduces loss, albeit with diminishing returns. The paper fits loss surfaces using formulations such as

$L(N, D) = A \left(\frac{N}{10^9}\right)^{-\alpha} + B \left(\frac{D}{10^9}\right)^{-\beta} + L_0$

and extends this to include retrieval via an additive logarithmic gain term:

$L(N, D, R) = A \left(\frac{N}{10^9}\right)^{-\alpha} + B \left(\frac{D}{10^9}\right)^{-\beta} - C\log\!\left(1+\eta\frac{R}{10^9}\right) + L_0$

where $N$ is model size, $D$ is pretraining tokens, and $R$ is retrieval corpus size. These 3D scaling surfaces are fit on multiple downstream benchmarks, with low CV ARE (<8% for most tasks), validating the model's capacity to capture retrieval-augmented scaling behavior. Importantly, retrieval provides diminishing returns, both in absolute loss reduction and marginal gains per retrieved token.

Figure 1: Power-law parametric scaling, with iso-loss contours and compute-efficient frontier; scaling fits match empirical observations across model sizes and data budgets.

Trade-Offs and Substitutability Between Pretraining and Retrieval

The crux of the paper is the joint optimization of pretraining and retrieval allocation under a fixed corpus budget. The authors quantify substitutability ( $\sigma$ ) – the number of pretraining tokens that can be replaced per retrieval token – and the marginal benefit of retrieval ( $\kappa$ ) – the loss reduction per unit of retrieved data. The results show clear regime dependence: retrieval is most effective in low-to-mid pretraining regimes and for smaller models, where capacity saturation has not been reached. Substitutability becomes significant (each retrieval token replaces multiple pretraining tokens) when pretraining tokens per parameter exceed $\sim4$ .

As model size increases, both $\sigma$ and $\kappa$ decrease, indicating diminishing marginal utility of retrieval. The efficiency of retrieval is highest for knowledge-centric tasks, and minimal or even negative for reasoning-heavy tasks where internal computational capacity is the limiting factor, not factual recall.

Figure 2: Allocation trade-off curves. Performance varies smoothly along the pretraining–retrieval frontier; smaller models achieve substantial gains with retrieval, large models exhibit saturation.

Figure 3: Substitutability and marginal gains of retrieval. Left: retrieval becomes an efficient substitute for pretraining above critical scale; right: marginal benefit peaks for small models and declines with scale.

Retrieval Quality and Query Formulation

The paper further isolates the impact of retrieval quality by varying the query formulation and evaluating oracle-style retrieval scenarios. Enhanced query constructs (e.g., including gold answers or answer choices) yield incremental gains but do not fundamentally alter the scale-dependent retrieval–pretraining trade-off. In knowledge-heavy tasks, improved retrieval precision provides stronger anchoring of factual cues, while in reasoning-heavy domains retrieval is rarely beneficial and may even distract the model.

Figure 4: Effect of query formulation on retrieval-augmented performance; oracle queries provide incremental improvement, but scaling trends persist.

Benchmark-Specific Scaling and Fit Robustness

Scaling law fits are benchmark-specific but exhibit robust predictability across held-out model sizes (LOMO error) and random seeds, with knowledge-driven tasks such as ARC or OpenBookQA showing stable fits and reasoning-centric benchmarks displaying higher variance. The calibration plots confirm strong predictive alignment between fitted 3D scaling laws and observed perplexity trends, further motivating the joint optimization of parametric and non-parametric data allocation.

Figure 5: Stability of scaling law fits across seeds and model families. Low variance indicates robust scaling structure.

Implications for Model and System Design

The results demonstrate that retrieval and pretraining must be considered as competing mechanisms for corpus utilization, rather than as independent system augmentations. Retrieval is most valuable in data-constrained, undertrained, or small-LM regimes, where parametric capacity is insufficient for comprehensive factual recall. As models scale up or approach pretraining saturation, retrieval's marginal utility declines, and optimization should prioritize internalization via pretraining.

The practical implications are substantial for efficient LM system design, particularly for SLMs or deployments with strict data and computational budgets. Explicit partitioning of corpora for parametric learning versus external retrieval can yield compute-efficient and memory-efficient architectures. Theoretical implications involve a unified treatment of scaling laws, suggesting retrieval is not a monotonic or universal substitute for memorization but a conditional, scale- and task-dependent resource.

Limitations and Future Directions

The study fixes retriever type and chunking strategy, which may understate maximum achievable gains for advanced retrieval pipelines or adaptive methods. The scaling law framework could be extended to incorporate retrieval quality metrics, relevance estimation, or adaptive allocation schemes (e.g., reranking, learned filtering). Unification across benchmarks would also facilitate generalization of scaling law parameters and explainability regarding retrieval sensitivity or reasoning burdens.

Inspired by analogies to human cognition, future research could explore purposeful allocation of abstract reasoning to pretraining versus long-tail factual knowledge to retrieval within hybrid memory architectures. More principled scheduling of data exposure and adaptive retrieval could further enhance sample efficiency.

Conclusion

This paper provides an authoritative quantification of the pretraining–retrieval trade-off in RAG-aware LM training, supported by a 3D scaling law framework and robust empirical analysis across diverse model sizes and tasks. The results delineate the scale-dependent substitutability, marginal gains, and practical limitations of retrieval augmentation as a complement to parametric learning. The insights have immediate practical value for efficient architecture design and corpus allocation, and lay the foundation for future extensions in retrieval pipeline optimization and scaling law theory.