SearchGym-RL: Curriculum RL for Search Agents
- SearchGym-RL is a curriculum-driven reinforcement learning framework that operates in a fully synthetic simulation environment built on a verifiable knowledge graph and aligned corpus.
- It employs a controlled, high-fidelity setup to deliver purified, factually grounded feedback, effectively mitigating corrupted reward signals found in live web API interactions.
- The methodology uses a multi-stage curriculum and GRPO optimization, enabling progressive skill acquisition and superior performance on complex open-domain and multi-hop QA benchmarks.
SearchGym-RL is a curriculum-driven reinforcement learning (RL) methodology designed to train robust search agents within the SearchGym simulation environment. This approach addresses the prohibitive costs and instability associated with live web API interactions and static data snapshots by constructing a high-fidelity, fully controllable synthetic world consisting of a verifiable knowledge graph and an aligned corpus. SearchGym-RL leverages this artificial environment to deliver purified, factually grounded feedback—eliminating common sources of corrupted reward signals—and to support progressive skill acquisition via curriculum learning. Empirical results demonstrate that agents trained with SearchGym-RL can surpass web-enhanced baselines on diverse open-domain and multi-hop question answering (QA) benchmarks, validating high-fidelity simulation as a cost-effective, scalable paradigm for search agent development (Zhang et al., 21 Jan 2026).
1. Simulation Environment and Data Generation
The SearchGym simulation environment models a synthetic world , where the knowledge graph is programmatically synthesized over a schema . Entities are instantiated by sampling attribute cardinalities (1–1, 1–n, n–1) and linked into a realistic, acyclic topology. Each vertex is associated with a document generated by a LLM conditioned on core entity facts and its neighborhood . For edge validation, a set of 15 candidate queries is generated, requiring that at least five yield among the top- results of a retrieval engine : Only such verified subgraphs are used for QA synthesis. Paths of length (under acyclicity and uniqueness) are sampled and classified as Simple, Parallel, or Combo, with verbalizing each path into pairs. The resulting corpus comprises Wikipedia-style pages and strictly solvable QA tasks covering 1–12 hops.
2. Formalization of the Reinforcement Learning Problem
Within , a search agent occupies state , encoding dialogue and interaction history (queries, retrieved snippets, documents). At each step, the agent selects an action , where
causing a deterministic transition . After a maximum of turns, the agent outputs an answer, receiving a terminal reward
with for and . The policy is optimized to maximize discounted return
Training utilizes Group Relative Policy Optimization (GRPO), with a clipped surrogate objective and optional KL penalty referencing
where .
3. Curriculum-Based Training Methodology
To mitigate sparse rewards and accelerate learning on long-horizon tasks, SearchGym-RL introduces a multi-stage curriculum:
- Stage 1: Training on Simple (linear) QA tasks (≤6 hops) establishes foundational capabilities.
- Stage 2: Progression to Parallel and Combo tasks (>6 hops) develops advanced decomposition and synthesis strategies.
At each stage, the agent rolls out trajectories per query, computes normalized advantages from terminal F1, and updates the policy via the GRPO surrogate loss. This curriculum, undergirded by purified and noise-free feedback, avoids destabilization associated with corrupted negatives and supports stable, monotonic policy improvement.
Curriculum Breakdown
| Stage | Task Type(s) | Path Length | Corpus Subset |
|---|---|---|---|
| 1 | Simple (linear) | ≤6 hops | |
| 2 | Parallel, Combo (decomposition) | >6 hops | , |
4. Training Regimen, Hyperparameters, and Cost
SearchGym-RL is implemented atop the AReal asynchronous RL framework and executed for five epochs on servers equipped with 8×NVIDIA H800 GPUs. Architectures evaluated include Llama-3.2-3B-Instruct, Qwen-2.5 (1.5B/7B, Base/Instruct), and Qwen-3 (4B/8B). Key hyperparameters:
- Learning rate: (AdamW, weight decay 0.01)
- Global batch size: 128 (micro-batch 16)
- Rollouts per query:
- GRPO clip:
- Temperature: 1.0
- Sequence length: 1024 tokens
Rollouts utilize a synthetic Meilisearch index with ms latency, yielding negligible commercial API costs. Entire corpus generation cost is reported at $\$50\sim\$500\approx 720{,}000$ calls.
5. Sim-to-Real Generalization: Benchmarks and Results
SearchGym-trained agents are evaluated across three retrieval settings—local synthetic, Wikipedia-2018, and live Web—on ten QA benchmarks: NQ, TriviaQA, PopQA (single-hop); HotpotQA, 2WikiMultiHop, Musique, Bamboogle (multi-hop); GAIA, xbench-DeepSearch (deep research); and SearchGymBench (held-out). A Qwen-2.5-72B Instruct LLM adjudicates Pass@1 for standard QA and Pass@4 for complex cases.
Key results (Qwen-2.5-7B-Base):
- Standard and multi-hop QA: 58.56 vs 56.61 for ASearcher-web (+1.95 absolute)
- GAIA/xbench-DeepSearch: Pass@4 of 42.72/49.00 vs 38.83/32.00 (+3.9/+17.0 absolute)
- Overall relative margin: +10.6% over all static-snapshot and simulation baselines
- Search efficiency: 37.3% fewer search actions in live web rollouts
These outcomes demonstrate robust sim-to-real transfer and superior sample efficiency compared to state-of-the-art baselines, with particular gains in deep research tasks.
6. Insights, Limitations, and Future Directions
SearchGym’s high-fidelity, closed-loop environment ensures time-stable, unambiguous, and retrievable QA tasks, thus eliminating the corrupted reward signals prevalent in static corpora. This setting enables stable and monotonic policy learning, a marked contrast to standard snapshot-based approaches that often exhibit reward volatility or collapse (e.g., Search-R1 after ~160 steps). The curriculum decomposes complex behaviors into tractable, sequentially-acquired skills, enabling efficient bootstrapping from basic search to long-range compositional reasoning.
Nonetheless, the strictly synthetic world lacks certain real-world dynamics, such as temporal data drift and retrieval noise. To partially address this, a minimal Wikipedia alignment phase is introduced to reduce domain discrepancy. Anticipated future work includes dynamic environment updates, adversarial retrieval perturbations, and expanded tool support (e.g., tables, code execution) to further bridge the simulation-to-real gap. A plausible implication is that such extensions could foster even more generalizable and capable search agents.
In sum, SearchGym-RL establishes that cost-free, high-fidelity simulation, when paired with refined curriculum design and robust learning protocols, can rival or surpass direct Web-API training for real-world search agent development (Zhang et al., 21 Jan 2026).