WideSeek-R1: Multi-Agent MARL Framework
- WideSeek-R1 is a multi-agent reinforcement learning system that decomposes complex, broad queries into independent subtasks via a lead and subagent hierarchy.
- The framework uses a hierarchical architecture to ensure context isolation and efficient parallel processing, achieving near ultra-large model performance with fewer resources.
- It leverages end-to-end MARL with PPO to optimize task decomposition, aggregation, and structured output generation for complex table-based queries.
WideSeek-R1 is a multi-agent reinforcement learning (MARL) framework targeting broad information-seeking tasks where traditional single-agent “depth scaling” LLM paradigms reach organizational and context limitations. The system introduces a hierarchical architecture that leverages width scaling: a lead agent that decomposes complex, table-generative queries into parallel subtasks handled by subagents initialized from a shared LLM, enabling effective parallelism, context isolation, and scalable orchestration. WideSeek-R1, when trained on a curated dataset of diverse, broad information-seeking problems, approaches the performance of single-agent, ultra-large models with substantially fewer compute and parameter resources, thus redefining the frontier for information-centric multi-agent systems (Xu et al., 4 Feb 2026).
1. Motivation: Organizational Bottlenecks in Broad Information Seeking
Conventional LLM-based agents perform well on depth-oriented reasoning—the “deep research” paradigm—by extending chain-of-thought lengths or increasing context window sizes. However, for broad tasks that require extracting structured information about large numbers of entities simultaneously (e.g., generating a table summarizing attributes for 30–50 countries), two principal pathologies emerge:
- Context pollution: A monolithic agent tasked with assimilating hundreds of facts across unrelated entities rapidly exceeds its context capacity, degrading focus and accuracy.
- Sequential inefficiency: Single-agent protocols process each row or column in series, squandering wall-clock time and context utilization, and failing to capitalize on potential parallelism.
WideSeek-R1 addresses these issues with width scaling. The lead agent decomposes a broad input query (with entities and attributes) into independent subqueries, each handled by a subagent. Subagents execute information-seeking actions in isolated contexts using external tools, after which the lead agent aggregates their findings into a coherent, structured output.
This paradigm enables the system to efficiently produce complex tables of the form
where workload is distributed across subagents, preserving context clarity and maximizing wall-clock parallelism (Xu et al., 4 Feb 2026).
2. System Architecture: Lead-Subagent Hierarchy and Workflow
WideSeek-R1 instantiates both the lead agent and all subagents from a shared LLM backbone (e.g., Qwen3-4B), enforcing context isolation and granting access to specialized tool interfaces.
Workflow:
- Lead agent initialization: Receives a system prompt and the user’s broad query .
- Task decomposition ( tool): Lead agent identifies decomposition points and generates prompts for subagents.
- Parallel subagent rollout: Each subagent receives a distinct subtask prompt:
Subagents iteratively employ external tools (1
[SYSTEM: You are sub-agent; SUBTASK: prompt_a]
search,access) and maintain private chain-of-thoughts until subtask completion. - Aggregation and reflection: The lead agent collects and integrates subagent outputs, periodically repeating decomposition and aggregation steps for up to turns.
- Finalization: The system emits a unified Markdown table output.
Critical features include strict context isolation (subagents cannot access each other’s internal state) and strict parallelism (all subagents spawn and execute concurrently within system-imposed limits, e.g., per turn). The architecture eliminates context collision and enables highly scalable orchestration (Xu et al., 4 Feb 2026).
3. Multi-Agent Reinforcement Learning Formulation
WideSeek-R1 employs an end-to-end multi-agent variant of Proximal Policy Optimization (PPO) for MARL across the lead agent and parallel subagents.
Components:
- State space for agent at rollout , turn :
where is the role prompt (lead or subagent), the assigned query, previous tokens, tool call results.
- Action space: At each generation token, the policy selects either a language token or a structured tool call.
- Reward function:
- = item-level F1 between output and ground truth.
- = +0.1 for valid Markdown table.
- = +0.05 for efficient tool usage.
- = penalty for overlong outputs.
- Unified MARL loss: All tokens of all agents in rollout share a group-normalized advantage:
PPO’s clipped loss is computed over all agent-token pairs, employing token- and agent-level reweighting to ensure that increased agent spawning improves rather than merely increasing capacity (Xu et al., 4 Feb 2026).
This protocol induces the lead agent to optimize decomposition strategies while ensuring subagents increase factual accuracy and efficiency via parallel execution.
4. Dataset Construction and Optimization Protocol
WideSeek-R1 is trained on a 20,000-instance corpus of broad information-seeking tasks generated via an automated pipeline:
- Query synthesis: HybridQA seeds and Gemini-3 prompts yield schema-constrained Markdown table queries, sampling row counts .
- Answer verification: Gemini-3 is prompted independently twice per query, accepting only those where corresponding cells match .
- Filtering: Instances with poor inter-run consistency or rows are discarded; 73.3% are retained.
- Statistics: Median table 30 rows, 6 columns; dataset split equally with deep-QA data for hybrid training.
Optimization details:
- Backbone: Qwen3-4B.
- Batch size: 128 (split evenly across “wide” and “deep” batches).
- Learning rate: , PPO clip .
- Max context: 32K tokens.
- Training: 150 RL steps on 32 NVIDIA H100 GPUs, totaling GPU-hours (Xu et al., 4 Feb 2026).
5. Empirical Evaluation and Scaling Results
Evaluation is conducted on the WideSearch benchmark (Wong et al. 2025), comprising 200 queries balanced across English and Chinese.
Metrics:
- Item F1: cell-level F1 score
- Row F1: row-unit match
- Success Rate: perfect-table match
| Model | Item F1 Avg@4 |
|---|---|
| SingleSeek-R1-4B | 28.1% |
| DeepSeek-R1-671B | 41.3% |
| WideSeek-R1-4B | 40.0% |
WideSeek-R1-4B, despite its moderate parameter count, approaches the performance of DeepSeek-R1-671B (a 167× larger model) and surpasses all multi-agent 4B/8B baselines. Scaling experiments reveal:
- Depth scaling (single agent, more turns) saturates at 30–35% Item F1.
- Width scaling (increased number of subagents per task) yields steady gains up to 40% Item F1 with 10 subagents.
- Naïve multi-agent methods (without MARL) incur interference and performance decay beyond 6 subagents; WideSeek-R1-4B maintains gains up to 10 subagents.
On seven standard QA datasets, WideSeek-R1-4B achieves Avg@4 59.0%, outperforming both its single-agent version and multi-agent systems with substantially larger LLMs (e.g. OWL-8B, MiroFlow-8B) (Xu et al., 4 Feb 2026).
6. Limitations and Prospective Directions
Identified Limitations
- Coarse credit assignment: The use of a single rollout score obscures errors attributable to decomposition versus local subagent execution.
- Fixed two-level hierarchy: Subagent recursion is disallowed; arbitrary-depth hierarchies may enable further organizational scaling.
- Training cost: Synchronous, collocated rollouts are latency-bound, impeding scalability.
Future Work
- Dynamic agent allocation: Adaptive subagent spawning for task-dependent load balancing.
- Recursive hierarchies: Permitting multi-layer agent structures for ultra-broad or nested tasks.
- Role-aware rewards: Providing targeted feedback to specific agent roles (e.g., decomposition accuracy for the lead agent, local F1s for subagents).
- Multi-domain generalization: Extending width scaling to code synthesis, GUI automation, and real-time information domains.
This suggests a plausible trajectory toward universal information-seeking systems that maximize not only individual competence but also organizational intelligence via explicit parallelism and robust MARL optimization (Xu et al., 4 Feb 2026, Huang et al., 2 Feb 2026).
7. Position within Wide Research and Related Work
WideSeek-R1 addresses the same class of General Broad Information Seeking (GBIS) tasks as WideSeek and is evaluated on related or overlapping benchmarks (WideSearch, WideSeekBench). Both frameworks employ hierarchical planner-executor models, context isolation, and joint MARL. However, WideSeek-R1 demonstrates that end-to-end parallel orchestration with robust MARL leads to state-of-the-art item-level F1 without reliance on extra-large LLMs or hand-crafted workflow logic (Xu et al., 4 Feb 2026, Huang et al., 2 Feb 2026).
The shift from deep, sequential research architectures to dynamic, scalable “wide research” protocols marks a structural inflection point in information-seeking AI, opening a new axis for scaling by organizational capacity rather than raw model size.