Papers
Topics
Authors
Recent
Search
2000 character limit reached

WideSeek-R1: Multi-Agent MARL Framework

Updated 5 February 2026
  • WideSeek-R1 is a multi-agent reinforcement learning system that decomposes complex, broad queries into independent subtasks via a lead and subagent hierarchy.
  • The framework uses a hierarchical architecture to ensure context isolation and efficient parallel processing, achieving near ultra-large model performance with fewer resources.
  • It leverages end-to-end MARL with PPO to optimize task decomposition, aggregation, and structured output generation for complex table-based queries.

WideSeek-R1 is a multi-agent reinforcement learning (MARL) framework targeting broad information-seeking tasks where traditional single-agent “depth scaling” LLM paradigms reach organizational and context limitations. The system introduces a hierarchical architecture that leverages width scaling: a lead agent that decomposes complex, table-generative queries into parallel subtasks handled by subagents initialized from a shared LLM, enabling effective parallelism, context isolation, and scalable orchestration. WideSeek-R1, when trained on a curated dataset of diverse, broad information-seeking problems, approaches the performance of single-agent, ultra-large models with substantially fewer compute and parameter resources, thus redefining the frontier for information-centric multi-agent systems (Xu et al., 4 Feb 2026).

1. Motivation: Organizational Bottlenecks in Broad Information Seeking

Conventional LLM-based agents perform well on depth-oriented reasoning—the “deep research” paradigm—by extending chain-of-thought lengths or increasing context window sizes. However, for broad tasks that require extracting structured information about large numbers of entities simultaneously (e.g., generating a table summarizing attributes for 30–50 countries), two principal pathologies emerge:

  • Context pollution: A monolithic agent tasked with assimilating hundreds of facts across unrelated entities rapidly exceeds its context capacity, degrading focus and accuracy.
  • Sequential inefficiency: Single-agent protocols process each row or column in series, squandering wall-clock time and context utilization, and failing to capitalize on potential parallelism.

WideSeek-R1 addresses these issues with width scaling. The lead agent decomposes a broad input query qq (with RR entities and CC attributes) into NN independent subqueries, each handled by a subagent. Subagents execute information-seeking actions in isolated contexts using external tools, after which the lead agent aggregates their findings into a coherent, structured output.

This paradigm enables the system to efficiently produce complex tables of the form

{xi,1,,xi,C}i=1R,\{ x_{i,1}, \dots, x_{i,C} \}_{i=1}^R,

where workload is distributed across subagents, preserving context clarity and maximizing wall-clock parallelism (Xu et al., 4 Feb 2026).

2. System Architecture: Lead-Subagent Hierarchy and Workflow

WideSeek-R1 instantiates both the lead agent and all subagents from a shared LLM backbone (e.g., Qwen3-4B), enforcing context isolation and granting access to specialized tool interfaces.

Workflow:

  1. Lead agent initialization: Receives a system prompt and the user’s broad query qq.
  2. Task decomposition (create_sub_agents\texttt{create\_sub\_agents} tool): Lead agent identifies decomposition points and generates prompts for kk subagents.
  3. Parallel subagent rollout: Each subagent receives a distinct subtask prompt:
    1
    
    [SYSTEM: You are sub-agent; SUBTASK: prompt_a]
    Subagents iteratively employ external tools (search, access) and maintain private chain-of-thoughts until subtask completion.
  4. Aggregation and reflection: The lead agent collects and integrates subagent outputs, periodically repeating decomposition and aggregation steps for up to TT turns.
  5. Finalization: The system emits a unified Markdown table output.

Critical features include strict context isolation (subagents cannot access each other’s internal state) and strict parallelism (all subagents spawn and execute concurrently within system-imposed limits, e.g., M=10M = 10 per turn). The architecture eliminates context collision and enables highly scalable orchestration (Xu et al., 4 Feb 2026).

3. Multi-Agent Reinforcement Learning Formulation

WideSeek-R1 employs an end-to-end multi-agent variant of Proximal Policy Optimization (PPO) for MARL across the lead agent and parallel subagents.

Components:

  • State space for agent aa at rollout ii, turn tt:

si,at=[pa,qa,oi,a1,tci,a1,,oi,at1,tci,at1]s_{i,a}^t = [p_a, q_a, o_{i,a}^1, tc_{i,a}^1, \dots, o_{i,a}^{t-1}, tc_{i,a}^{t-1}]

where pap_a is the role prompt (lead or subagent), qaq_a the assigned query, oo previous tokens, tctc tool call results.

  • Action space: At each generation token, the policy πθ\pi_\theta selects either a language token or a structured tool call.
  • Reward function:

Ri=rans+rformat+rtoolrlenR_i = r_{\mathrm{ans}} + r_{\mathrm{format}} + r_{\mathrm{tool}} - r_{\mathrm{len}}

  • ransr_{\mathrm{ans}} = item-level F1 between output and ground truth.
  • rformatr_{\mathrm{format}} = +0.1 for valid Markdown table.
  • rtoolr_{\mathrm{tool}} = +0.05 for efficient tool usage.
  • rlenr_{\mathrm{len}} = penalty for overlong outputs.
    • Unified MARL loss: All tokens of all agents in rollout ii share a group-normalized advantage:

A^i=Riμσ\hat A_i = \frac{R_i - \mu}{\sigma}

PPO’s clipped loss is computed over all agent-token pairs, employing token- and agent-level reweighting to ensure that increased agent spawning improves RiR_i rather than merely increasing capacity (Xu et al., 4 Feb 2026).

This protocol induces the lead agent to optimize decomposition strategies while ensuring subagents increase factual accuracy and efficiency via parallel execution.

4. Dataset Construction and Optimization Protocol

WideSeek-R1 is trained on a 20,000-instance corpus of broad information-seeking tasks generated via an automated pipeline:

  • Query synthesis: HybridQA seeds and Gemini-3 prompts yield schema-constrained Markdown table queries, sampling row counts R[10,50]R \in [10,50].
  • Answer verification: Gemini-3 is prompted independently twice per query, accepting only those where corresponding cells match 90%\geq 90\%.
  • Filtering: Instances with poor inter-run consistency or <3< 3 rows are discarded; 73.3% are retained.
  • Statistics: Median table \approx 30 rows, 6 columns; dataset split equally with deep-QA data for hybrid training.

Optimization details:

  • Backbone: Qwen3-4B.
  • Batch size: 128 (split evenly across “wide” and “deep” batches).
  • Learning rate: 1×1061 \times 10^{-6}, PPO clip ϵ=0.2\epsilon=0.2.
  • Max context: 32K tokens.
  • Training: 150 RL steps on 32 NVIDIA H100 GPUs, totaling 3,000\sim 3,000 GPU-hours (Xu et al., 4 Feb 2026).

5. Empirical Evaluation and Scaling Results

Evaluation is conducted on the WideSearch benchmark (Wong et al. 2025), comprising 200 queries balanced across English and Chinese.

Metrics:

  • Item F1: cell-level F1 score
  • Row F1: row-unit match
  • Success Rate: perfect-table match
Model Item F1 Avg@4
SingleSeek-R1-4B 28.1%
DeepSeek-R1-671B 41.3%
WideSeek-R1-4B 40.0%

WideSeek-R1-4B, despite its moderate parameter count, approaches the performance of DeepSeek-R1-671B (a 167× larger model) and surpasses all multi-agent 4B/8B baselines. Scaling experiments reveal:

  • Depth scaling (single agent, more turns) saturates at 30–35% Item F1.
  • Width scaling (increased number of subagents per task) yields steady gains up to 40% Item F1 with 10 subagents.
  • Naïve multi-agent methods (without MARL) incur interference and performance decay beyond 6 subagents; WideSeek-R1-4B maintains gains up to 10 subagents.

On seven standard QA datasets, WideSeek-R1-4B achieves Avg@4 \sim 59.0%, outperforming both its single-agent version and multi-agent systems with substantially larger LLMs (e.g. OWL-8B, MiroFlow-8B) (Xu et al., 4 Feb 2026).

6. Limitations and Prospective Directions

Identified Limitations

  • Coarse credit assignment: The use of a single rollout score obscures errors attributable to decomposition versus local subagent execution.
  • Fixed two-level hierarchy: Subagent recursion is disallowed; arbitrary-depth hierarchies may enable further organizational scaling.
  • Training cost: Synchronous, collocated rollouts are latency-bound, impeding scalability.

Future Work

  • Dynamic agent allocation: Adaptive subagent spawning for task-dependent load balancing.
  • Recursive hierarchies: Permitting multi-layer agent structures for ultra-broad or nested tasks.
  • Role-aware rewards: Providing targeted feedback to specific agent roles (e.g., decomposition accuracy for the lead agent, local F1s for subagents).
  • Multi-domain generalization: Extending width scaling to code synthesis, GUI automation, and real-time information domains.

This suggests a plausible trajectory toward universal information-seeking systems that maximize not only individual competence but also organizational intelligence via explicit parallelism and robust MARL optimization (Xu et al., 4 Feb 2026, Huang et al., 2 Feb 2026).

WideSeek-R1 addresses the same class of General Broad Information Seeking (GBIS) tasks as WideSeek and is evaluated on related or overlapping benchmarks (WideSearch, WideSeekBench). Both frameworks employ hierarchical planner-executor models, context isolation, and joint MARL. However, WideSeek-R1 demonstrates that end-to-end parallel orchestration with robust MARL leads to state-of-the-art item-level F1 without reliance on extra-large LLMs or hand-crafted workflow logic (Xu et al., 4 Feb 2026, Huang et al., 2 Feb 2026).

The shift from deep, sequential research architectures to dynamic, scalable “wide research” protocols marks a structural inflection point in information-seeking AI, opening a new axis for scaling by organizational capacity rather than raw model size.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WideSeek-R1.