Papers
Topics
Authors
Recent
Search
2000 character limit reached

Reinforcement Learning Foundations for Deep Research Systems: A Survey

Published 8 Sep 2025 in cs.AI, cs.CL, and cs.IR | (2509.06733v1)

Abstract: Deep research systems, agentic AI that solve complex, multi-step tasks by coordinating reasoning, search across the open web and user files, and tool use, are moving toward hierarchical deployments with a Planner, Coordinator, and Executors. In practice, training entire stacks end-to-end remains impractical, so most work trains a single planner connected to core tools such as search, browsing, and code. While SFT imparts protocol fidelity, it suffers from imitation and exposure biases and underuses environment feedback. Preference alignment methods such as DPO are schema and proxy-dependent, off-policy, and weak for long-horizon credit assignment and multi-objective trade-offs. A further limitation of SFT and DPO is their reliance on human defined decision points and subskills through schema design and labeled comparisons. Reinforcement learning aligns with closed-loop, tool-interaction research by optimizing trajectory-level policies, enabling exploration, recovery behaviors, and principled credit assignment, and it reduces dependence on such human priors and rater biases. This survey is, to our knowledge, the first dedicated to the RL foundations of deep research systems. It systematizes work after DeepSeek-R1 along three axes: (i) data synthesis and curation; (ii) RL methods for agentic research covering stability, sample efficiency, long context handling, reward and credit design, multi-objective optimization, and multimodal integration; and (iii) agentic RL training systems and frameworks. We also cover agent architecture and coordination, as well as evaluation and benchmarks, including recent QA, VQA, long-form synthesis, and domain-grounded, tool-interaction tasks. We distill recurring patterns, surface infrastructure bottlenecks, and offer practical guidance for training robust, transparent deep research agents with RL.

Summary

  • The paper identifies reinforcement learning as a strategic improvement over supervised fine-tuning in deep research systems.
  • It outlines innovative methodologies in data synthesis, reward design, and hierarchical agent coordination for managing multi-step tasks.
  • The survey offers practical insights on curriculum design and multimodal integration, underpinning scalable and efficient RL training frameworks.

Reinforcement Learning Foundations for Deep Research Systems: A Survey

The study titled "Reinforcement Learning Foundations for Deep Research Systems: A Survey" provides a comprehensive examination of reinforcement learning (RL) methodologies applicable for training deep research systems within agentic AI frameworks. The paper provides a systematic categorization of existing works based on data synthesis, RL methods for agentic research, and training frameworks, while also tackling issues related to hierarchical agent coordination and evaluation benchmarks. This essay elucidates the core components, methodologies, and future implications as discussed in the paper (2509.06733).

Introduction to Deep Research Systems

Deep research systems are envisioned as autonomous AI entities capable of executing complex, multi-step inquiries across digital information landscapes (Figure 1). These systems are architecturally framed with a hierarchical structure featuring a Planner, Coordinator, and a suite of Executors to manage strategic reasoning, task decomposition, and actionable follow-through. Figure 1

Figure 1: Illustration of the hierarchical deep research system architecture.

Supervised Fine-Tuning (SFT) is employed to lay the foundation for these systems, but its limitations—imitation and exposure biases, along with underutilization of dynamic environment feedback—highlight the potential of RL methodologies. Reinforcement Learning, focusing on optimizing trajectory-level policies, offers a strategic advantage in complex task domains by minimizing dependencies on human priors and improving resilience through exploration and sophisticated credit assignments. Figure 2

Figure 2: Illustration of QA Task Complexity Levels.

The authors categorize the literature into three main axes: (i) data synthesis and curation, (ii) RL methods for agentic research inclusive of stability and rewards design, multimodal integration, and (iii) RL training systems. These axes are analyzed to present a cohesive view of the current landscape and to extract practical insights for advancing the field.

Data Synthesis and Curation

The success of deep research systems is intricately tied to the quality of data used for training. Synthetic data generation, consequently, plays a pivotal role. Current research segments this domain into three primary strategies: cross-document composition, structure-driven path growth, and difficulty staging by transformation/rollouts. Each approach targets eliciting and refining model capabilities for complex, multi-step reasoning tasks (Table 1).

The paper distinguishes RL training data from SFT/DPO in its purpose of prioritizing end-to-end improvement from closed-loop, verifiable environment signals, as opposed to imitation (SFT) or relative preference alignment (DPO). RL data are designed to reward the system for trajectory-level performance, leveraging both outcome and step-level feedback. This reduces reliance on human priors and biases by permitting exploration and principled trade-offs over long horizons. The authors categorize QA tasks into four complexity levels (Figure 2) to guide dataset construction and curriculum design.

\subsection{RL Methods for Agentic Research} Deep research systems evaluate multi-step, tool-rich environments, thus requiring advanced RL training pipelines (see example works in Table \ref{tab:rl-regime}). Building on the established DeepSeek-R1-style pipeline, recent innovations enhance stability, efficiency, and scalability. Critical themes include cold-start strategies, curriculum design, cost and latency control in training, optimized token stochastic gradient descent (PPO/GRPO with token masking and KL anchors), guided exploration, and verifiable, outcome-first rewards to ensure stable optimization without tool-avoidance or reward hacking, as illustrated in Figure 1.

\paragraph{Training Regimes:} Fundamental to long-horizon learning is the training regime itself. The standard approach of a cold-started (optional SFT/RSFT) policy; templated rollouts with explicit tool tags and budgets; outcome and format rewards; and PPO/GRPO (plus KL penalties) provide anchoring stability. Beyond this baseline, research introduces improvements focusing on curriculum learning and search necessity, optimizing sample efficiency, exploration, and multi-objective trade-offs by applying warm starts and dynamic/task-specific curricula.

\paragraph{Reward Design} Recent research illuminates methodologies for both outcome-level and step-level credit (Table \ref{tab:rl-reward}). While verifiable outcome rewards anchor instruction alignment, novel signals—gain-beyond-RAG, group-relative efficiency, knowledge-boundary checks—and fine-grained, step-level process rewards (tool execution, evidence utility) effectively bias search and reasoning. These strategies enhance performance on multi-step tasks, albeit the choice of rewarding eventually affects stability and policy effectiveness. Open questions remain on composing/scheduling multiple objectives without inducing reward hacking and learning budget-aware, risk-sensitive policies.

Multimodal Integration

Deep research systems extend to multimodal settings, necessitating solutions for tasks involving diverse data types (Table \ref{tab:rl-multimodal}). The survey delineates evolving models that integrate vision-LLMs (VLMs) to unify token space perception and reasoning, emphasizing action-initiated perception strategies (crop/zoom, edit-reason cycles) under high-entropy tasks. These agents demand observation engineering to foster verifiable evidence utilization and discern modality preferences, offering vital progress paths for efficient reasoning over complex, heterogeneous inputs.

Agent Architecture and Coordination

The hierarchical architecture of deep research systems emphasizes the delineation between planning and execution, allowing for strategic tools, task delegation, and division of labor, facilitated by Coordinator and Executors. The survey highlights various system architectures focusing on task orchestration methodologies, displaying varied choices in planning roles, tool structures, and human observability. These strategies inform scalable, reliable AI solutions for real-world challenges while considering adaptation possibilities for distinct task volumes.

Conclusion

This paper dismantles the complex RL foundations essential for training and deploying deep research systems. By addressing RL's scalability, data curation, reward design, and coordination intricacies, the paper maps a pathway for enhancing AI task proficiency in multi-step environments. Potential advancements lie in evaluating refined reward models, multimodal unification, and further optimization of longitudinal agent behavior tasks, essential for expanding AI's scope in various domains, reflecting on shared agent roles and decision-making frameworks within dynamic operational environments.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper is a survey (a big review) about how to train “deep research systems” — AI agents that can plan and carry out multi-step investigations on the internet and in files, use tools like search and code, and then write good answers with citations. The authors argue that reinforcement learning (RL) is the best way to train these agents end to end, and they organize recent research into a clear map so others can build better systems.

What questions does the paper try to answer?

In simple terms, the paper asks:

  • What are deep research agents, and how should they be built?
  • Why do common training methods like “copying from examples” (SFT) or “picking the better answer” (DPO) fall short?
  • How does reinforcement learning help an agent learn from trial and error across many steps and tools?
  • How should we create training tasks, design rewards, and set up systems so these agents learn well?
  • How do we evaluate whether these agents are actually good at real research?

How did the authors study it?

This is a survey, so the authors read many papers (mainly from 2025) and organized them into a few big themes. They also explain key ideas in easy-to-understand ways and point out patterns that keep showing up.

The pieces of a deep research agent

Think of a school project team:

  • Planner: the “brain” that breaks the big problem into smaller steps and decides what to do next.
  • Coordinator: the “team lead” who assigns tasks, collects results, and checks them.
  • Executors: the “specialists” who do specific jobs — search the web, browse pages, run code, read images, etc.

In real systems, training the whole team at once is too hard. So most work trains just the Planner to get very good at reasoning and using a few core tools. Later, this stronger Planner can plug into the full team.

Why SFT and DPO aren’t enough

  • SFT (Supervised Fine-Tuning) = “learn by copying examples.” Great for learning formats (like how to call a tool or cite sources), but weak for long tasks where early mistakes snowball.
  • DPO (preference learning) = “learn by choosing between two drafts.” Helpful for picking better text, but it doesn’t truly connect actions (like which thing to search) to final success, and it depends a lot on hand-made labels and step designs.

Both miss the feedback loop of real-world tools and web changes (pages can move, sites can block you, prices change).

Why reinforcement learning (RL) helps

RL = “learn by doing, get a score, adjust.” It:

  • Learns from the whole journey, not just the final sentence.
  • Figures out which earlier steps deserve credit or blame (credit assignment).
  • Encourages exploring different strategies and recovery behaviors (like trying a new query when one fails).
  • Lets you balance goals: accuracy, time, cost, safety.

How to get training data for RL

To train by trial and error, you need good tasks and clear success checks. The survey explains two key levers:

  • Construct: Create hard questions that require multiple steps, recent info, or multiple webpages (so the model can’t just guess or memorize). For example:
    • Cross-document questions that need combining info from several sources.
    • Browsing over link graphs (like clicking through Wikipedia) to find evidence.
    • Easy-to-hard rewrites that gradually add more steps to solve.
    • Obfuscation (hiding obvious clues) to prevent one-shot answers.
  • Curate: Filter and schedule those tasks so learning is effective:
    • Remove questions that are answerable from memory or one page.
    • Keep tasks that can be checked automatically (exact answer, a test, or a reliable judge model).
    • Use difficulty labels and curricula (start easier, ramp up).

The authors also propose four “levels” of task difficulty:

  • Level 1: Simple, answerable from memory or one search.
  • Level 2: Multi-hop but with a clear path (classic multi-step QA).
  • Level 3: Messy, uncertain, requires broad exploration and cross-checking across texts.
  • Level 4: Like Level 3, but multimodal (text + images/audio/code) and multiple tools.

How RL training usually works

Typical training pipeline (in plain language):

  • Cold start (optional): Give the model a short “practice phase” to learn formats and tool-calling basics so early RL doesn’t crash.
  • Rollouts: For each question, the agent thinks, decides to search or use a tool, reads results, thinks again, and eventually answers — all in a structured script with tags like > , <search>, <information>, <answer>.

    • Rewards: Score the final answer (e.g., exact match), and sometimes also give small rewards for correct formatting. Some works add step-level signals (like “did this search actually help?”).

    • Update: Use a training recipe (like PPO or GRPO) to improve the policy from these scores. Many systems “ignore” tool-generated text when updating so the model learns only from its own tokens.

    Common improvements the survey highlights:

    • Cold start helps stability and faster learning.
    • Curriculum learning (start easy, go harder) improves results.
    • Outcome rewards (final correctness) are most common; format rewards help structure; some add retrieval-usefulness rewards.
    • Multiple training recipes exist (PPO, GRPO, REINFORCE); each trades off stability vs. speed, but they all work with the same overall loop.

    Systems, architectures, and evaluation

    • Training frameworks: Open-source toolkits are emerging to run many long, tool-based rollouts at scale. The survey lists these and notes bottlenecks (cost, logging, reproducibility).
    • Architectures: The Planner–Coordinator–Executors setup is common. Multi-agent and hierarchical designs make it easier to swap tools, parallelize, and audit decisions.
    • Benchmarks: A wide range of tests exist — from question answering, to browsing-heavy tasks, to long-form writing with citations, to domain-specific setups (finance, etc.), and multimodal benchmarks (text + images).

    What did they find, in simple takeaways?

    Here are the main lessons the authors distill:

    • Training the whole stack end-to-end is impractical today. Focus on training the Planner well with RL, then plug it into a larger system.
    • SFT and DPO are useful scaffolding (teach format and basic moves), but they don’t truly optimize multi-step, tool-using behavior in changing web environments.
    • RL matches the problem: it learns strategies across many steps, uses real feedback from tools, and supports exploration and recovery.
    • Good data is key: build tasks that force multi-step reasoning and recent info; curate them to remove shortcuts and schedule by difficulty.
    • A practical curriculum: start with Level 1–2 to learn formats and basic planning, move to Level 3 for exploration and cross-checking, then Level 4 for multimodal skills.
    • Common training patterns: cold start helps; curriculum helps; final-answer rewards plus small format checks are strong baselines; standard RL optimizers (PPO/GRPO) work reliably.
    • Infrastructure matters: long rollouts, tool calls, and logging require solid systems. Masking tool tokens during training and keeping consistent formats improve stability.
    • Evaluation is expanding: beyond short answers, tests now include browsing, long reports with citations, multimodal questions, and domain-grounded tasks.

    Why is this important?

    If we want AI that can do real research — not just write nice-sounding text — it must plan, search, read, verify, and synthesize across many steps and tools. This survey explains how RL can train those skills directly:

    • Better reliability: Agents learn recovery strategies when searches fail or sources disagree.
    • Less hand-holding: Fewer human-made step labels and rigid schemas.
    • Clearer trade-offs: Agents can learn to balance accuracy, time, and cost.
    • More transparency: Structured roles (Planner, Coordinator, Executors) make it easier to log, audit, and assign credit to steps.

    The paper also points to promising future directions:

    • Active task generation: Let the agent help create and pick the next best training tasks based on what it’s weak at.
    • Better step-level judges: Cheap, reliable ways to score process quality (not just final answers).
    • Stronger multimodal and multi-objective training: Integrate images, tables, code, and clear goals (accuracy, safety, cost) over long horizons.
    • Scalable systems: Faster, cheaper infrastructure for many long, tool-rich training runs.

    In short, this survey offers a roadmap for building robust, transparent “AI researchers” using reinforcement learning — moving from copying examples to truly learning how to investigate and decide.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 31 likes about this paper.