Papers
Topics
Authors
Recent
Search
2000 character limit reached

SearchGym-RL: Curriculum RL for Search Agents

Updated 22 January 2026
  • SearchGym-RL is a curriculum-driven reinforcement learning framework that operates in a fully synthetic simulation environment built on a verifiable knowledge graph and aligned corpus.
  • It employs a controlled, high-fidelity setup to deliver purified, factually grounded feedback, effectively mitigating corrupted reward signals found in live web API interactions.
  • The methodology uses a multi-stage curriculum and GRPO optimization, enabling progressive skill acquisition and superior performance on complex open-domain and multi-hop QA benchmarks.

SearchGym-RL is a curriculum-driven reinforcement learning (RL) methodology designed to train robust search agents within the SearchGym simulation environment. This approach addresses the prohibitive costs and instability associated with live web API interactions and static data snapshots by constructing a high-fidelity, fully controllable synthetic world consisting of a verifiable knowledge graph and an aligned corpus. SearchGym-RL leverages this artificial environment to deliver purified, factually grounded feedback—eliminating common sources of corrupted reward signals—and to support progressive skill acquisition via curriculum learning. Empirical results demonstrate that agents trained with SearchGym-RL can surpass web-enhanced baselines on diverse open-domain and multi-hop question answering (QA) benchmarks, validating high-fidelity simulation as a cost-effective, scalable paradigm for search agent development (Zhang et al., 21 Jan 2026).

1. Simulation Environment and Data Generation

The SearchGym simulation environment models a synthetic world W=G,D\mathcal{W} = \langle \mathcal{G},\,\mathcal{D}\rangle, where the knowledge graph G=(V,E)\mathcal{G}=(\mathcal{V},\mathcal{E}) is programmatically synthesized over a schema S\mathcal{S}. Entities are instantiated by sampling attribute cardinalities (1–1, 1–n, n–1) and linked into a realistic, acyclic topology. Each vertex vVv \in \mathcal{V} is associated with a document dvd_v generated by a LLM MgenM_{\mathrm{gen}} conditioned on core entity facts and its neighborhood Nv\mathcal{N}_v. For edge (u,v)(u,v) validation, a set Qe\mathcal{Q}_e of 15 candidate queries is generated, requiring that at least five yield dvd_v among the top-KK results of a retrieval engine R\mathcal{R}: (u,v)E    {qQe:dvTop-K(R(q,D))}5.(u,v)\in\mathcal{E}^* \iff \Big|\big\{q\in\mathcal{Q}_e : d_v\in\mathrm{Top}\text{-}K(\mathcal{R}(q,\mathcal{D}))\big\}\Big| \ge 5. Only such verified subgraphs G\mathcal{G}^* are used for QA synthesis. Paths of length kk (under acyclicity and uniqueness) are sampled and classified as Simple, Parallel, or Combo, with MgenM_{\mathrm{gen}} verbalizing each path into (Q,A)(Q,A) pairs. The resulting corpus comprises 3,600\sim3{,}600 Wikipedia-style pages and 41,000\sim41{,}000 strictly solvable QA tasks covering 1–12 hops.

2. Formalization of the Reinforcement Learning Problem

Within W\mathcal{W}, a search agent occupies state sts_t, encoding dialogue and interaction history (queries, retrieved snippets, documents). At each step, the agent selects an action atAa_t \in \mathcal{A}, where

A={Search(q),Access(u),Answer(y)},\mathcal{A} = \{\mathrm{Search}(q),\,\mathrm{Access}(u),\,\mathrm{Answer}(y)\},

causing a deterministic transition st+1=T(st,at)s_{t+1}=T(s_t,a_t). After a maximum of TT turns, the agent outputs an answer, receiving a terminal reward

R(τ)=F1(A^(τ),Agt),R(\tau) = \mathrm{F1}(\hat A(\tau),\,A_{\mathrm{gt}}),

with rt=0r_t=0 for t<Tt<T and rT=R(τ)r_T=R(\tau). The policy πθ(atst)\pi_\theta(a_t\mid s_t) is optimized to maximize discounted return

J(θ)=Eτπθ[t=0Tγtrt].J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\Big[\sum_{t=0}^T \gamma^t r_t \Big].

Training utilizes Group Relative Policy Optimization (GRPO), with a clipped surrogate objective and optional KL penalty referencing

L(θ)=Ei[min(ρi(θ)A^i,clip(ρi(θ),1ϵ,1+ϵ)A^i)],\mathcal{L}(\theta) = \mathbb{E}_i\Big[\min(\rho_i(\theta)\hat A_i,\,\mathrm{clip}(\rho_i(\theta),1-\epsilon,1+\epsilon)\hat A_i)\Big],

where ρi=πθ(τi)πθold(τi)\rho_i = \frac{\pi_\theta(\tau_i)}{\pi_{\theta_\mathrm{old}}(\tau_i)}.

3. Curriculum-Based Training Methodology

To mitigate sparse rewards and accelerate learning on long-horizon tasks, SearchGym-RL introduces a multi-stage curriculum:

  • Stage 1: Training on Simple (linear) QA tasks (≤6 hops) establishes foundational capabilities.
  • Stage 2: Progression to Parallel and Combo tasks (>6 hops) develops advanced decomposition and synthesis strategies.

At each stage, the agent rolls out NN trajectories per query, computes normalized advantages from terminal F1, and updates the policy via the GRPO surrogate loss. This curriculum, undergirded by purified and noise-free feedback, avoids destabilization associated with corrupted negatives and supports stable, monotonic policy improvement.

Curriculum Breakdown

Stage Task Type(s) Path Length Corpus Subset
1 Simple (linear) ≤6 hops Dsimple\mathcal{D}_\mathrm{simple}
2 Parallel, Combo (decomposition) >6 hops Dparallel\mathcal{D}_\mathrm{parallel}, Dcombo\mathcal{D}_\mathrm{combo}

4. Training Regimen, Hyperparameters, and Cost

SearchGym-RL is implemented atop the AReal asynchronous RL framework and executed for five epochs on servers equipped with 8×NVIDIA H800 GPUs. Architectures evaluated include Llama-3.2-3B-Instruct, Qwen-2.5 (1.5B/7B, Base/Instruct), and Qwen-3 (4B/8B). Key hyperparameters:

  • Learning rate: 5×1065\times10^{-6} (AdamW, weight decay 0.01)
  • Global batch size: 128 (micro-batch 16)
  • Rollouts per query: N=8N=8
  • GRPO clip: ϵ=0.4\epsilon=0.4
  • Temperature: 1.0
  • Sequence length: 1024 tokens

Rollouts utilize a synthetic Meilisearch index with <50<50 ms latency, yielding negligible commercial API costs. Entire corpus generation cost is reported at $\$50for3,600pagesand41,000QApairs,contrastingfavorablytothefor 3,600 pages and 41,000 QA pairs, contrasting favorably to the\sim\$500costper<ahref="https://www.emergentmind.com/topics/transformerbasedproximalpolicyoptimizationppo"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">PPO</a>epochwhenusinglivewebAPIsrequiring cost per <a href="https://www.emergentmind.com/topics/transformer-based-proximal-policy-optimization-ppo" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">PPO</a> epoch when using live web APIs requiring \approx 720{,}000$ calls.

5. Sim-to-Real Generalization: Benchmarks and Results

SearchGym-trained agents are evaluated across three retrieval settings—local synthetic, Wikipedia-2018, and live Web—on ten QA benchmarks: NQ, TriviaQA, PopQA (single-hop); HotpotQA, 2WikiMultiHop, Musique, Bamboogle (multi-hop); GAIA, xbench-DeepSearch (deep research); and SearchGymBench (held-out). A Qwen-2.5-72B Instruct LLM adjudicates Pass@1 for standard QA and Pass@4 for complex cases.

Key results (Qwen-2.5-7B-Base):

  • Standard and multi-hop QA: 58.56 vs 56.61 for ASearcher-web (+1.95 absolute)
  • GAIA/xbench-DeepSearch: Pass@4 of 42.72/49.00 vs 38.83/32.00 (+3.9/+17.0 absolute)
  • Overall relative margin: +10.6% over all static-snapshot and simulation baselines
  • Search efficiency: 37.3% fewer search actions in live web rollouts

These outcomes demonstrate robust sim-to-real transfer and superior sample efficiency compared to state-of-the-art baselines, with particular gains in deep research tasks.

6. Insights, Limitations, and Future Directions

SearchGym’s high-fidelity, closed-loop environment ensures time-stable, unambiguous, and retrievable QA tasks, thus eliminating the corrupted reward signals prevalent in static corpora. This setting enables stable and monotonic policy learning, a marked contrast to standard snapshot-based approaches that often exhibit reward volatility or collapse (e.g., Search-R1 after ~160 steps). The curriculum decomposes complex behaviors into tractable, sequentially-acquired skills, enabling efficient bootstrapping from basic search to long-range compositional reasoning.

Nonetheless, the strictly synthetic world lacks certain real-world dynamics, such as temporal data drift and retrieval noise. To partially address this, a minimal Wikipedia alignment phase is introduced to reduce domain discrepancy. Anticipated future work includes dynamic environment updates, adversarial retrieval perturbations, and expanded tool support (e.g., tables, code execution) to further bridge the simulation-to-real gap. A plausible implication is that such extensions could foster even more generalizable and capable search agents.

In sum, SearchGym-RL establishes that cost-free, high-fidelity simulation, when paired with refined curriculum design and robust learning protocols, can rival or surpass direct Web-API training for real-world search agent development (Zhang et al., 21 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SearchGym-RL.