WideSeek-R1: Multi-Agent MARL Framework

Updated 5 February 2026

WideSeek-R1 is a multi-agent reinforcement learning system that decomposes complex, broad queries into independent subtasks via a lead and subagent hierarchy.
The framework uses a hierarchical architecture to ensure context isolation and efficient parallel processing, achieving near ultra-large model performance with fewer resources.
It leverages end-to-end MARL with PPO to optimize task decomposition, aggregation, and structured output generation for complex table-based queries.

WideSeek-R1 is a multi-agent reinforcement learning (MARL) framework targeting broad information-seeking tasks where traditional single-agent “depth scaling” LLM paradigms reach organizational and context limitations. The system introduces a hierarchical architecture that leverages width scaling: a lead agent that decomposes complex, table-generative queries into parallel subtasks handled by subagents initialized from a shared LLM, enabling effective parallelism, context isolation, and scalable orchestration. WideSeek-R1, when trained on a curated dataset of diverse, broad information-seeking problems, approaches the performance of single-agent, ultra-large models with substantially fewer compute and parameter resources, thus redefining the frontier for information-centric multi-agent systems (Xu et al., 4 Feb 2026).

1. Motivation: Organizational Bottlenecks in Broad Information Seeking

Conventional LLM-based agents perform well on depth-oriented reasoning—the “deep research” paradigm—by extending chain-of-thought lengths or increasing context window sizes. However, for broad tasks that require extracting structured information about large numbers of entities simultaneously (e.g., generating a table summarizing attributes for 30–50 countries), two principal pathologies emerge:

Context pollution: A monolithic agent tasked with assimilating hundreds of facts across unrelated entities rapidly exceeds its context capacity, degrading focus and accuracy.
Sequential inefficiency: Single-agent protocols process each row or column in series, squandering wall-clock time and context utilization, and failing to capitalize on potential parallelism.

WideSeek-R1 addresses these issues with width scaling. The lead agent decomposes a broad input query $q$ (with $R$ entities and $C$ attributes) into $N$ independent subqueries, each handled by a subagent. Subagents execute information-seeking actions in isolated contexts using external tools, after which the lead agent aggregates their findings into a coherent, structured output.

This paradigm enables the system to efficiently produce complex tables of the form

$\{ x_{i,1}, \dots, x_{i,C} \}_{i=1}^R,$

where workload is distributed across subagents, preserving context clarity and maximizing wall-clock parallelism (Xu et al., 4 Feb 2026).

2. System Architecture: Lead-Subagent Hierarchy and Workflow

WideSeek-R1 instantiates both the lead agent and all subagents from a shared LLM backbone (e.g., Qwen3-4B), enforcing context isolation and granting access to specialized tool interfaces.

Workflow:

Lead agent initialization: Receives a system prompt and the user’s broad query $q$ .
Task decomposition ( $\texttt{create\_sub\_agents}$ tool): Lead agent identifies decomposition points and generates prompts for $k$ subagents.
Parallel subagent rollout: Each subagent receives a distinct subtask prompt:
1
[SYSTEM: You are sub-agent; SUBTASK: prompt_a]
Subagents iteratively employ external tools (search, access) and maintain private chain-of-thoughts until subtask completion.
Aggregation and reflection: The lead agent collects and integrates subagent outputs, periodically repeating decomposition and aggregation steps for up to $T$ turns.
Finalization: The system emits a unified Markdown table output.

Critical features include strict context isolation (subagents cannot access each other’s internal state) and strict parallelism (all subagents spawn and execute concurrently within system-imposed limits, e.g., $M = 10$ per turn). The architecture eliminates context collision and enables highly scalable orchestration (Xu et al., 4 Feb 2026).

3. Multi-Agent Reinforcement Learning Formulation

WideSeek-R1 employs an end-to-end multi-agent variant of Proximal Policy Optimization (PPO) for MARL across the lead agent and parallel subagents.

Components:

State space for agent $a$ at rollout $i$ , turn $t$ :

$s_{i,a}^t = [p_a, q_a, o_{i,a}^1, tc_{i,a}^1, \dots, o_{i,a}^{t-1}, tc_{i,a}^{t-1}]$

where $p_a$ is the role prompt (lead or subagent), $q_a$ the assigned query, $o$ previous tokens, $tc$ tool call results.

Action space: At each generation token, the policy $\pi_\theta$ selects either a language token or a structured tool call.
Reward function:

$R_i = r_{\mathrm{ans}} + r_{\mathrm{format}} + r_{\mathrm{tool}} - r_{\mathrm{len}}$

$r_{\mathrm{ans}}$ = item-level F1 between output and ground truth.
$r_{\mathrm{format}}$ = +0.1 for valid Markdown table.
$r_{\mathrm{tool}}$ = +0.05 for efficient tool usage.
$r_{\mathrm{len}}$ $r_{len}$ = penalty for overlong outputs.
- Unified MARL loss: All tokens of all agents in rollout $i$ share a group-normalized advantage:

$\hat A_i = \frac{R_i - \mu}{\sigma}$

PPO’s clipped loss is computed over all agent-token pairs, employing token- and agent-level reweighting to ensure that increased agent spawning improves $R_i$ rather than merely increasing capacity (Xu et al., 4 Feb 2026).

This protocol induces the lead agent to optimize decomposition strategies while ensuring subagents increase factual accuracy and efficiency via parallel execution.

4. Dataset Construction and Optimization Protocol

WideSeek-R1 is trained on a 20,000-instance corpus of broad information-seeking tasks generated via an automated pipeline:

Query synthesis: HybridQA seeds and Gemini-3 prompts yield schema-constrained Markdown table queries, sampling row counts $R \in [10,50]$ .
Answer verification: Gemini-3 is prompted independently twice per query, accepting only those where corresponding cells match $\geq 90\%$ .
Filtering: Instances with poor inter-run consistency or $< 3$ rows are discarded; 73.3% are retained.
Statistics: Median table $\approx$ 30 rows, 6 columns; dataset split equally with deep-QA data for hybrid training.

Optimization details:

Backbone: Qwen3-4B.
Batch size: 128 (split evenly across “wide” and “deep” batches).
Learning rate: $1 \times 10^{-6}$ , PPO clip $\epsilon=0.2$ .
Max context: 32K tokens.
Training: 150 RL steps on 32 NVIDIA H100 GPUs, totaling $\sim 3,000$ GPU-hours (Xu et al., 4 Feb 2026).

5. Empirical Evaluation and Scaling Results

Evaluation is conducted on the WideSearch benchmark (Wong et al. 2025), comprising 200 queries balanced across English and Chinese.

Metrics:

Item F1: cell-level F1 score
Row F1: row-unit match
Success Rate: perfect-table match

Model	Item F1 Avg@4
SingleSeek-R1-4B	28.1%
DeepSeek-R1-671B	41.3%
WideSeek-R1-4B	40.0%

WideSeek-R1-4B, despite its moderate parameter count, approaches the performance of DeepSeek-R1-671B (a 167× larger model) and surpasses all multi-agent 4B/8B baselines. Scaling experiments reveal:

Depth scaling (single agent, more turns) saturates at 30–35% Item F1.
Width scaling (increased number of subagents per task) yields steady gains up to 40% Item F1 with 10 subagents.
Naïve multi-agent methods (without MARL) incur interference and performance decay beyond 6 subagents; WideSeek-R1-4B maintains gains up to 10 subagents.

On seven standard QA datasets, WideSeek-R1-4B achieves Avg@4 $\sim$ 59.0%, outperforming both its single-agent version and multi-agent systems with substantially larger LLMs (e.g. OWL-8B, MiroFlow-8B) (Xu et al., 4 Feb 2026).

6. Limitations and Prospective Directions

Identified Limitations

Coarse credit assignment: The use of a single rollout score obscures errors attributable to decomposition versus local subagent execution.
Fixed two-level hierarchy: Subagent recursion is disallowed; arbitrary-depth hierarchies may enable further organizational scaling.
Training cost: Synchronous, collocated rollouts are latency-bound, impeding scalability.

Future Work

Dynamic agent allocation: Adaptive subagent spawning for task-dependent load balancing.
Recursive hierarchies: Permitting multi-layer agent structures for ultra-broad or nested tasks.
Role-aware rewards: Providing targeted feedback to specific agent roles (e.g., decomposition accuracy for the lead agent, local F1s for subagents).
Multi-domain generalization: Extending width scaling to code synthesis, GUI automation, and real-time information domains.

This suggests a plausible trajectory toward universal information-seeking systems that maximize not only individual competence but also organizational intelligence via explicit parallelism and robust MARL optimization (Xu et al., 4 Feb 2026, Huang et al., 2 Feb 2026).

WideSeek-R1 addresses the same class of General Broad Information Seeking (GBIS) tasks as WideSeek and is evaluated on related or overlapping benchmarks (WideSearch, WideSeekBench). Both frameworks employ hierarchical planner-executor models, context isolation, and joint MARL. However, WideSeek-R1 demonstrates that end-to-end parallel orchestration with robust MARL leads to state-of-the-art item-level F1 without reliance on extra-large LLMs or hand-crafted workflow logic (Xu et al., 4 Feb 2026, Huang et al., 2 Feb 2026).

The shift from deep, sequential research architectures to dynamic, scalable “wide research” protocols marks a structural inflection point in information-seeking AI, opening a new axis for scaling by organizational capacity rather than raw model size.

Markdown Report Issue Upgrade to Chat

References (2)

WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning (2026)

WideSeek: Advancing Wide Research via Multi-Agent Scaling (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WideSeek-R1.