Papers
Topics
Authors
Recent
Search
2000 character limit reached

Search-R1 Codebase for RL-Optimized Reasoning

Updated 2 February 2026
  • Search-R1 Codebase is an open-source reinforcement learning framework that enables large language models to perform retrieval-augmented reasoning via dynamic search interactions.
  • It employs techniques like token-level reward assignment, retrieved-token masking, and group-normalized advantage baselines to optimize multi-turn question answering.
  • Its modular design supports rapid experimentation with interchangeable retrieval backends and demonstrates significant EM gains on datasets such as NQ, HotpotQA, and TriviaQA.

Search-R1 Codebase

The Search-R1 codebase is an open-source reinforcement learning (RL) framework that enables LLMs to learn reasoning workflows that leverage real-time search engine interaction. The key focus of Search-R1 is the joint optimization of multi-turn question answering and dynamic search query generation, employing token-level reward assignment, retrieved-token masking, and highly scalable RL with group-normalized advantage baselines. The codebase and pretrained models are publicly released, providing a foundation for research in tool-augmented LLMs and retrieval-augmented reasoning (Jin et al., 12 Mar 2025).

1. Architecture and Modular Design

The Search-R1 repository is structured to support rapid experimentation with RL for reasoning agents equipped with real-time search. At the core, the system is organized into modules for agent behavior, rollout orchestration, retrieval integration, learning and optimization, and checkpoint management.

  • Agent module (agent.py): Provides PPOAgent and GRPOAgent classes, encapsulating policy sampling, advantage normalization, and surrogate objectives.
  • Rollout module (rollout.py): Implements multi-turn search-augmented rollouts. At each turn, the LLM can emit search queries (bracketed by <search>...</search>) which are routed to a retrieval engine. The top-k results are injected as retrieval spans (wrapped by <retrieval>...</retrieval>) into the context. Trajectories are terminated by <answer>...</answer>.
  • Retriever module (retriever.py): Abstracts dense retrieval with E5 embeddings and a Wikipedia 2018 FAISS index, supporting real-time passage lookup.
  • Masking utilities (masking.py): Ensure gradients are only backpropagated through LLM-generated tokens, not the inserted retrievals.
  • RL Trainer (trainer.py): Orchestrates agent updating, batching, rollout management, optimization, and loss scaling.
  • Reward computation (rewards.py): Implements outcome-based reward functions, typically using exact-match (EM) as the terminal signal.
  • Evaluation and metrics (metrics.py, eval_search_r1.py): Provide EM, macro-average reporting, and benchmark evaluation logic.

The codebase supports domain-agnostic QA datasets (NaturalQuestions, HotpotQA, TriviaQA, etc.), and is backend-agnostic for retrieval, as long as the retriever conforms to the query–results API (Jin et al., 12 Mar 2025).

2. Algorithmic Foundation: RL for Search-Augmented Reasoning

The principal RL objective in Search-R1 is to optimize an LLM policy πθ\pi_\theta over full reasoning trajectories yy that include dynamic search interaction:

J(θ)=Ex∼D, y∼πθ(⋅∣x;R)[rϕ(x,y)]−β DKL[πθ(⋅∣x;R) ∥ πref(⋅∣x;R)]J(\theta) = \mathbb{E}_{x \sim \mathcal{D},\ y \sim \pi_\theta(\cdot|x;R)} \bigl[r_\phi(x, y)\bigr] - \beta\,D_{\mathrm{KL}}\left[\pi_\theta(\cdot|x;R)\ \| \ \pi_{\mathrm{ref}}(\cdot|x;R)\right]

where rϕ(x,y)r_\phi(x, y) is the reward function, typically $1$ if the exact match answer is correct, and $0$ otherwise; β\beta is a KL penalty coefficient; and πref\pi_{\mathrm{ref}} is the frozen reference LLM.

Two RL algorithms are implemented:

For both, only LLM-generated tokens (excluding search result insertions) contribute to policy gradient updates, stabilizing the optimization by decoupling the model from the unpredictable content of retrieved tokens (Jin et al., 12 Mar 2025).

3. Search Integration and Masked Supervision

The retrieval integration strategy is central to Search-R1:

  • The model is prompted to emit search queries via <search>...<\search>.
  • On detecting a completed search token span, the agent extracts the search query, issues it to the dense retriever, obtains top-3 passages, and injects them into the context in the form <retrieval>passage_text</retrieval>.
  • Retrieved tokens are explicitly masked during both the surrogate loss and the KL-divergence computation. This design prevents the LLM from "learning" the answer content from the retrievals themselves, supporting generalization and stable RL (Jin et al., 12 Mar 2025).

The model is required to output its final answer within <answer>...</answer>. This conventions ensures trajectory parsing and terminal reward assignment are robust and dataset-agnostic.

4. Training Pipeline, Hyperparameters, and Execution

The training pipeline proceeds as follows:

  • Datasets: Merged NQ + HotpotQA train splits are the default (but any answerable QA dataset can be preprocessed accordingly).
  • Batching and Sampling: Batch size of $512$; for GRPO, G=5G=5 trajectories per example.
  • Sequence Lengths: max_input=4096, max_response=500, max_retrieval=500.
  • RL Hyperparameters: policy_lr=1\!\times\!10^{-6}, clip_epsilon=0.2, KL_beta=0.001, GAE λ=1.0\lambda=1.0, γ=1.0\gamma=1.0.
  • Optimization: Gradient checkpointing and FSDP are recommended for stability and memory efficiency.
  • Model Backbones: Qwen2.5-7B and Qwen2.5-3B models are natively supported.

A typical GRPO invocation:

1
2
3
4
5
python scripts/train_search_r1.py \
  --config configs/grpo_qwen3b.yaml \
  --rl_method grpo \
  --model_name qwen-2.5-3b-base \
  --output_dir outputs/grpo_qwen3b
(Jin et al., 12 Mar 2025).

5. Evaluation, Checkpoints, and Empirical Results

Evaluation is performed by running the trained RL agent over held-out datasets using the same multi-turn search rollout logic. Metrics include dataset-specific EM, macro-average EM, and latency statistics.

Checkpointing:

  • Per save_every steps, checkpoints contain model state, value network state (for PPO), optimizer state, and random generator states.
  • Checkpoints can be loaded for both training resumption and downstream inference.

Empirically, Search-R1 demonstrates large gains over RAG and baseline agentic search paradigms. On the standard benchmark suite (NQ, HotpotQA, TriviaQA, etc.), Search-R1 achieves relative EM improvements of 41% (Qwen2.5-7B) and 20% (Qwen2.5-3B) under matched resource and data constraints (Jin et al., 12 Mar 2025). These results validate the advantage of end-to-end RL for tool-augmented reasoning agents.

6. Extensibility, Best Practices, and Community Usage

Search-R1 is designed for modular extension:

  • Retrieval backends can be swapped by adapting retriever.py to new indices or APIs.
  • New RL objectives (e.g., custom reward functions, multi-hop verification) can be implemented by extending rewards.py.
  • The masking and roll-infrastructure is agnostic to the codebase, permitting direct experimentation with new question types or tool integrations.
  • The codebase is compatible with standard model hub (HuggingFace) workflows and can scale across GPUs via accelerate or deepspeed.

Best practices:

  • Always mask retrieved tokens in loss calculation for stability.
  • For competitive domains, tune β\beta (KL penalty) carefully to balance exploration and drift control.
  • Monitor reward collapse; tune learning rates to prevent gradient instability.
  • Use multi-turn rollouts for tasks where dynamic search and reasoning are required.

The codebase, with documentation and detailed scripts, enables rapid research prototyping of retrieval-augmented, RL-trained LLM agents (Jin et al., 12 Mar 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Search-R1 Codebase.