Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction
Abstract: Agentic web search increasingly faces two distinct demands: deep reasoning over a single target, and structured aggregation across many entities and heterogeneous sources. Current systems struggle on both fronts. Breadth-oriented tasks demand schema-aligned outputs with wide coverage and cross-entity consistency, while depth-oriented tasks require coherent reasoning over long, branching search trajectories. We introduce \textbf{Web2BigTable}, a multi-agent framework for web-to-table search that supports both regimes. Web2BigTable adopts a bi-level architecture in which an upper-level orchestrator decomposes the task into sub-problems and lower-level worker agents solve them in parallel. Through a closed-loop run--verify--reflect process, the framework jointly improves decomposition and execution over time via persistent, human-readable external memory, with self-evolving updates to each single-agent. During execution, workers coordinate through a shared workspace that makes partial findings visible, allowing them to reduce redundant exploration, reconcile conflicting evidence, and adapt to emerging coverage gaps. Web2BigTable sets a new state of the art on WideSearch, reaching an Avg@4 Success Rate of \textbf{38.50} ($7.5\times$ the second best at 5.10), Row F1 of \textbf{63.53} (+25.03 over the second best), and Item F1 of \textbf{80.12} (+14.42 over the second best). It also generalises to depth-oriented search on XBench-DeepSearch, achieving 73.0 accuracy. Code is available at https://github.com/web2bigtable/web2bigtable.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Explaining โWeb2BigTableโ in simple terms
1) What is this paper about?
This paper introduces Web2BigTable, an AI system that can search the internet and turn what it finds into neat, fact-checked tables. Itโs built for two kinds of jobs:
- โDeepโ search: carefully following clues to answer one tricky question.
- โWideโ search: gathering lots of facts about many things and organizing them into a big table.
Think of it like a well-run group project: one โcoachโ plans the work, and many โteam membersโ collect and check information in parallel, then combine it all into a clean, consistent table.
2) What were the main goals?
The researchers wanted to solve two everyday problems for AI that searches the web:
- Wide search: Can the AI cover a lot of items (like all concerts or all products), keep the format consistent, and avoid missing pieces?
- Deep search: Can the AI think through a chain of clues to get one correct answer?
They asked: Can a team-based AI, with good planning and shared memory, do both jobs better than a single AI working alone?
3) How does the system work? (With simple analogies)
Web2BigTable uses a โbi-levelโ setupโtwo levels that work together:
- The orchestrator (coach): Breaks a big task into smaller, clear subtasks. For example, if the goal is to list all concerts from 2010โ2025, the coach might assign different years or regions to different team members.
- Worker agents (team members): Each worker searches the web for their assigned part, checks facts, and fills in their portion of the table.
Two kinds of โmemoryโ help them cooperate:
- Short-term โworkboardโ (shared whiteboard): A live document everyone can read. It shows whatโs done, whatโs missing, and partial findings. This lets workers:
- Avoid doing the same work twice,
- Spot gaps (like missing columns or years),
- Share good sources and fix conflicts (if two sources disagree).
- Long-term โskillsโ (playbook): The system saves successful strategies in human-readable notes and tools. Over time, it learns better ways to split tasks (coach skills) and better ways to search and verify information (worker skills). Importantly, the AI models themselves arenโt retrained; instead, the system improves by updating these external, text-based skill notes and small tools.
Training vs. using the system:
- Training (practice): The system runs, checks its output against correct answers, and then โreflectsโ on mistakesโthis run-verify-reflect cycle updates the playbook.
- Inference (game day): The system uses the playbook as-is to answer new questions without changing the models.
Technical terms in everyday language:
- Schema: The column layout of the table (e.g., Date, City, Venue).
- Verification: Double-checking each cell against web sources.
- Decomposition: Breaking the big task into smaller, manageable chunks.
- F1 score and Success Rate: Ways of measuring correctness. Success Rate is an all-or-nothing score for entire tables. F1 is a balanced measure of how many correct items were found and how many mistakes were avoided, either per row or per cell.
4) What did they find, and why is it important?
Main results:
- On a wide-search benchmark called WideSearch, Web2BigTable set a new state-of-the-art:
- Success Rate: 38.50, which is about 7.5 times higher than the next best system (5.10).
- Row F1: 63.53, which is 25 points higher than the next best.
- Item (cell) F1: 80.12, which is over 14 points higher than the next best.
- On a deep-search benchmark (XBench-DeepSearch), it achieved 73.0 accuracy, showing it also handles clue-following tasks well.
Why this matters:
- Most earlier systems struggled either with carefully reasoning through long chains (deep search) or with covering many items consistently (wide search). Web2BigTable excels at both by combining smart planning with teamwork and shared memory.
What made the biggest difference (from their tests that โablateโ features to see whatโs essential):
- Learned orchestrator skills (the coachโs playbook) mattered most. Removing them caused the largest drop in performance.
- The shared workboard (the teamโs whiteboard) also mattered a lot; without it, workers couldnโt see each otherโs progress or fill gaps well.
- Worker skill evolution (improving search-and-verify tools) provided steady extra gains.
A key insight:
- The strong performance came from the frameworkโs design (the coach-team setup, shared memory, and reflection loop), not just from using powerful AI models. Even with smaller, cheaper models, the framework beat other systems that used bigger models.
5) Whatโs the impact?
- Better web-to-table tools: This system can help build accurate, large tables from the live webโuseful for research, business reports, event listings, product comparisons, and more.
- Reliable AI teamwork: It shows how AI โteamsโ with a clear leader, a shared workspace, and a growing playbook can outperform solo AI, especially on big, real-world tasks.
- Safer, more transparent improvements: Because the system improves using human-readable notes and tools (not by retraining models), itโs easier to audit, edit, and control how it learns.
- General approach for future agents: The bi-level plan + shared memory + run-verify-reflect pattern could be applied to many other AI tasks that need both careful reasoning and broad coverage.
In short, Web2BigTable demonstrates that good organization, collaboration, and steady learning from mistakes can make AI much better at finding and structuring information from the internetโjust like a well-coached team beats a talented solo player.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a focused list of what remains unresolved or insufficiently explored in the paper, formulated to guide concrete follow-up work:
- Benchmark scope and generalization
- How well do the learned orchestrator and worker skills transfer to domains beyond WideSearch/XBench (e.g., scientific literature, finance filings, biomedical sites), new schemas, and languages outside English/Chinese?
- Sensitivity to the number and diversity of training queries (only 20 used): Whatโs the sample-efficiency curve, and how does domain shift after training affect performance?
- Scalability and systems limits
- Coordination scalability beyond ~10 workers and hundreds of rows: throughput, latency, and contention when scaling to thousands of rows or 50โ100+ concurrent workers.
- Workboard as a single Markdown file with file locks: risks of I/O bottlenecks, lock contention, and failure modes in distributed/multi-machine settings; need for sharding or transactional storage.
- Context growth management: strategies for pruning/summarizing the workboard as it grows (to avoid context overflow) and the effect on quality.
- Skill bank growth and retrieval
- Retrieval quality and latency as skill banks grow (skill bloat): how to prevent retrieval noise, redundancy, or conflicts among overlapping skills; policies for de-duplication, ranking, decay, and archival.
- Negative transfer: whether larger, noisier skill banks degrade performance; mechanisms for skill validation, rollback, and automated tests before promotion to the shared bank.
- Robustness to real-world web variability
- Handling dynamic, JavaScript-heavy, paywalled, or anti-bot sites (no headless browser or rendering engine described); success rates and fallbacks in those scenarios.
- Temporal robustness: stability over time as web pages change (no longitudinal evaluation).
- Resilience to network errors, rate limits, captchas, and intermittent tool failures; retry policies and caching strategies are unspecified.
- Safety and security
- Prompt injection and data-poisoning defenses for web content (e.g., instruction-hijacking from pages); currently no threat model or mitigation protocols.
- Secure execution of auto-generated code and โbashโ tools: sandboxing, permission boundaries, dependency control, and prevention of arbitrary command execution beyond AST validation.
- Supply-chain risks from โcloud skillsโ catalogs (8,000+ preconfigured skills): provenance, trust, and vetting processes.
- Verification, provenance, and trust
- Inference-time verification is disabled (no run/verify loop): impact on factuality and error rates without post-hoc checks; feasibility of a lightweight per-cell verification pass at inference.
- Per-cell provenance delivery: how URLs/evidence are captured and presented to users for auditability and reproducibility.
- Confidence/uncertainty estimation per cell/row and calibrated trust signals for end users.
- Conflict resolution and consistency
- Explicit mechanisms for reconciling conflicting sources and ties (beyond โvalidation passโ): how to adjudicate disagreements, weight sources, or enforce schema-level constraints and cross-attribute consistency.
- Duplicate detection and removal at aggregation (authors state no post-processing): rates of duplication and concrete methods to eliminate them without harming recall.
- Evaluation design and metrics
- Cost, latency, and energy usage vs. baselines (absent): end-to-end efficiency profiles and trade-offs for different worker counts and model choices.
- Reproducibility under nondeterminism from asynchronous execution; variance across runs beyond Avg@4/Max@4 and how to tighten variance.
- Reliability of LLM-as-judge for free-text/item scoring: sensitivity analyses, inter-judge consistency, and impact on conclusions.
- Error taxonomy beyond aggregate F1/SR: granular analysis of failure modes (coverage gaps, extraction errors, mis-typing, temporal drift, source conflicts).
- Model and tool choices
- Impact of stronger/weaker LLMs and different tool stacks (e.g., headless browsers, structured extractors); systematic scaling laws for model capability vs. framework gains.
- Ablations on retrieval pipelines (BM25 vs. embeddings vs. reranker), top-k settings, and their interactions with worker performance.
- Human factors and ethics
- User-facing usability and interpretability studies: do workboard traces and outputs support human oversight and rapid error correction?
- Ethical and legal considerations for live web scraping (robots.txt compliance, ToS, jurisdictional constraints) and handling of sensitive content.
- Lifecycle management
- Skill rot and drift: detecting outdated or brittle skills as sites evolve; scheduling re-validation and automated refitting of skills.
- Governance for editing SKILL.md (human-in-the-loop review, versioning policies, audit trails) and criteria for promoting skills from local to shared/global banks.
- Extensibility
- Support for richer target structures beyond flat tables (nested schemas, graphs, temporal series, units/normalization) and constraints-aware extraction (e.g., integrity constraints, type systems).
- Integration with structured knowledge bases to improve recall/precision and to enforce global consistency.
- Failure containment and recovery
- Systematic strategies for recovering from partial failures (tool timeouts, worker crashes) without losing global progress; checkpointing and idempotent aggregation.
These gaps suggest concrete next steps: scale and latency stress tests, adversarial robustness evaluations, inference-time verification/citation mechanisms, security-hardening with sandboxing and injection defenses, skill lifecycle governance, richer conflict-resolution logic, and broader cross-domain/language benchmarks with cost-quality trade-off analyses.
Practical Applications
Overview
Web2BigTable introduces a bi-level, multi-agent web-to-table framework that automates large-scale, schema-aligned information extraction from the live web. Its key innovationsโtask-adaptive decomposition (orchestrator skills), parallel worker coordination via a shared workboard, and self-evolving, human-readable skill banks (no model fine-tuning)โenable both broad-coverage extraction and deep, multi-hop search. These capabilities translate into practical workflows for building and maintaining structured datasets with provenance, at scale.
Below are concrete applications across industry, academia, policy, and daily life, grouped by immediate deployability versus longer-term opportunities that require additional engineering, scaling, or domain adaptation.
Immediate Applications
These can be deployed with current capabilities (as described in the paper), assuming access to web search tools/APIs, basic scraping infrastructure, and lightweight LLMs for orchestration and worker roles.
- Competitive intelligence dashboards โ sector: software/enterprise, finance
- What: Continuously compile tables of competitorsโ product launches, pricing tiers, feature matrices, store locations, job postings, or press releases across brands/regions.
- How it maps: Orchestrator splits by company/product lines; workers extract attributes with cell-level verification; workboard avoids redundant sources and fills coverage gaps.
- Tools/products/workflows: โWeb-to-Table CI Agentโ; scheduled agentic ETL into Snowflake/BigQuery; versioned SKILL.md for domain-specific decomposition.
- Assumptions/dependencies: Respect robots.txt/ToS; handle anti-bot pages; maintain search API quotas; curate 10โ20 gold tasks to seed strategy memory.
- E-commerce catalog enrichment โ sector: retail, marketplaces
- What: Aggregate product specs, SKUs, prices, availability, and warranty terms from manufacturers, retailers, and distributors into a unified catalog.
- How it maps: Decompose by brand/category; workers standardize attributes and verify via multi-source evidence; shared workboard propagates high-quality source URLs and templates.
- Tools/products/workflows: โCatalog Extractorโ connectors; schema validators using Item-F1-like comparators; CI/CD on skill banks.
- Assumptions/dependencies: Dynamic pages, variants, and region-specific pricing; compliance with scraping; structured schema design.
- ESG and event data extraction โ sector: finance
- What: Pull ESG metrics, filings, earnings calendars, insider transactions, M&A events, and sanctions updates into structured tables with provenance.
- How it maps: Orchestrator partitions by issuer and data type; workers retrieve from regulators, company IR pages, trusted aggregators; run-verify-reflect improves coverage over time.
- Tools/products/workflows: โProvenance-grade Data Feedsโ for BI; lineage storage per cell (URLs, timestamps).
- Assumptions/dependencies: Disambiguation of entities; paywalled content; frequent web changesโcache and snapshot for auditability.
- Public procurement and grants consolidation โ sector: government/policy
- What: Aggregate tenders, awards, suppliers, amounts, and timelines across agencies and jurisdictions.
- How it maps: Decompose by geography/agency; workers coordinate to reconcile duplicates or missing fields.
- Tools/products/workflows: โGovSpend Table Builderโ; open-data enrichment pipeline.
- Assumptions/dependencies: Heterogeneous portals; rate limits; legal constraints on scraping.
- Pharmacovigilance signals and clinical trial tracking โ sector: healthcare
- What: Build tables of clinical trials (status, endpoints, locations) and adverse event reports from registries and safety communications.
- How it maps: Partition by molecule/indication; workers standardize medical terminology and verify cross-source evidence.
- Tools/products/workflows: โClinical Trials Extractorโ feeding analytics; compliance-grade provenance logs.
- Assumptions/dependencies: Medical ontology mapping; careful handling of ambiguous free text; data-use policies.
- Academic literature scaffolding โ sector: academia/education
- What: Create structured bibliographies (authors, venue, year, DOI), conference schedules, dataset indexes, and benchmarks across fields.
- How it maps: Orchestrator splits by venue/time; workers use search and archive APIs; workboard shares canonical sources and formatting.
- Tools/products/workflows: โWeb-to-Table Lit Review Assistantโ; export to Zotero/CSV.
- Assumptions/dependencies: PDF parsing often required (extend skills); access to metadata APIs.
- Real estate and infrastructure registries โ sector: real estate, energy
- What: Aggregate listings or project registries (e.g., renewable energy projects: capacity, location, status) from public portals, utilities, and developers.
- How it maps: Decompose by region/asset class; workers normalize units and resolve conflicts.
- Tools/products/workflows: โAsset Registry Builderโ with scheduled refresh.
- Assumptions/dependencies: Geocoding integration; frequent updates and de-duplication.
- Cybersecurity knowledge tables โ sector: software/security
- What: Build and maintain tables of CVEs (severity, affected versions), exploit PoCs, and patch availability from NVD, vendor advisories, and trusted feeds.
- How it maps: Partition by vendor/product; workers verify across advisory and NVD IDs; workboard coordinates edge cases.
- Tools/products/workflows: โVuln Table Feedโ for SOC dashboards; alerting when new rows appear.
- Assumptions/dependencies: Rate-limited APIs; identity resolution across trackers.
- Journalism/fact-checking datasets โ sector: media
- What: Compile structured timelines of events, public statements, and sources for investigative pieces.
- How it maps: Split by person/event; workers record sources per cell; orchestrator validates row counts and consistency.
- Tools/products/workflows: โFact-Checked Timeline Builderโ with citation exports.
- Assumptions/dependencies: Editorial review remains required; manage dynamic and conflicting sources.
- Personal comparison tables โ sector: daily life/consumer
- What: Create up-to-date tables of phone/plans, travel options (routes, baggage policies), or credit cards (fees, perks), grounded in live web sources.
- How it maps: Decompose by brand/route; workers standardize attributes; workboard avoids duplicate vendor checks.
- Tools/products/workflows: Consumer โCompare-Anythingโ assistant with provenance.
- Assumptions/dependencies: Frequent changes; paywalls; ensure transparent source links.
- Internal knowledge base indexing โ sector: enterprise IT
- What: Convert intranet policies, FAQs, and service catalogs into schema-aligned tables for search and governance.
- How it maps: Use the same framework on private web/docs; skills stored as SKILL.md to codify decomposition per department.
- Tools/products/workflows: โAgentic KB Curatorโ with access control; audit-ready lineage.
- Assumptions/dependencies: Authentication to internal systems; privacy and security controls; no external scraping required.
Long-Term Applications
These are promising but require further research, scaling, or integration (e.g., richer browsers, stronger provenance tracking, domain ontologies, legal agreements, or robust continuous learning at scale).
- Enterprise knowledge graph construction and refresh โ sector: cross-industry
- What: From repeated web-to-table extractions, build and maintain knowledge graphs (entities, relations) with cell-level provenance.
- Enablers: Extend workers with entity linking, ontologies, and deduplication; schedule run-verify-reflect for continual updates.
- Dependencies: Scalable storage/graph DBs; entity resolution; advanced provenance/versioning; robust diffing over the live web.
- Regulatory monitoring and automated compliance reporting โ sector: finance, healthcare, energy, telecom
- What: Track new rules/guidance, map obligations to entities/processes, and produce structured compliance matrices.
- Enablers: Decomposition by jurisdiction/topic; skill banks per regulator; escalation workflows for human review.
- Dependencies: Legal review; subscription/paywalled sources; auditable change logs.
- Scientific knowledge base curation at scale โ sector: academia/biotech
- What: Extract structured facts (methods, datasets, results) across literature to power meta-analyses and discovery.
- Enablers: Robust PDF/HTML parsing skills, domain schemas, and semantic comparators beyond Item-F1.
- Dependencies: Publisher licenses; accurate schema/ontology mapping; handling figures/tables; evaluation beyond LLM-as-judge.
- Real-time market and risk intelligence โ sector: finance/supply chain
- What: Agents ingest news, filings, social signals to update tables and trigger alerts (e.g., supply chain disruptions, credit events).
- Enablers: Streaming ingestion; incremental extraction; priority scheduling per topic; confidence scoring.
- Dependencies: Low-latency pipelines; dedup at high velocity; governance for false positives.
- Agentic ETL for data warehouses and BI โ sector: software/data platforms
- What: A managed โWeb-to-Tableโ connector that schedules, extracts, validates, and loads structured web data with lineage into warehouses.
- Enablers: Productizing the orchestrator/worker/workboard pattern with retries, caching, and schema drift detection.
- Dependencies: Enterprise-grade auth, observability, sandboxing for function skills, SLAs.
- Skill bank marketplaces and Org memory ops (AgentOps) โ sector: software tooling
- What: Share, version, and govern SKILL.md strategies and execution skills across teams; route tasks to best skills.
- Enablers: Registry, semantic retrieval, approval workflows; telemetry-driven refinement.
- Dependencies: Security review for executable skills; IP/licensing for shared skills; compatibility across LLM backbones.
- Search engines with structured โwideโ answers โ sector: search
- What: Answer complex, breadth-oriented queries with verified tables (e.g., โall grants for X between 2015โ2025โ) and provenance.
- Enablers: Tight integration with indexing, caching, and citation UX; robust coverage guarantees.
- Dependencies: Costs, latency constraints, quality thresholds; safe handling of dynamic/long-tail pages.
- Civic tech and open-data enrichment โ sector: government/civil society
- What: Fill gaps in public datasets (e.g., schools, transit, environmental permits) and validate records against the web.
- Enablers: Community-managed skill banks; transparent workboard logs; reproducible pipelines.
- Dependencies: Legal compliance; volunteer/human-in-the-loop validation; sustainable hosting.
- RPA-triggered business workflows โ sector: operations
- What: Use extracted tables to drive downstream actions (e.g., procurement shortlist creation, vendor outreach).
- Enablers: Confidence thresholds, human approvals, and integration with RPA tools.
- Dependencies: Strong provenance and audit trails; error mitigation; secure action execution.
- Personal digital steward with persistent tables โ sector: consumer
- What: Maintain up-to-date tables of subscriptions, bills, travel, school activities, and renewals; notify on changes.
- Enablers: OAuth to user accounts; privacy-preserving local skill execution; periodic sync.
- Dependencies: Data privacy/security; multi-source auth; UX for consent and transparency.
Cross-Cutting Assumptions and Dependencies
- Data access and legality: Respect robots.txt/ToS, obtain needed licenses, manage paywalls, and adhere to privacy/security policies.
- Tooling stack: Search/browse APIs, headless browser for dynamic pages, BM25/embedding stores (e.g., ChromaDB), file locking for workboard, and LLMs capable of tool use.
- Skill bank seeding: Small but curated set of gold-standard tasks (e.g., ~20 per domain) to learn decomposition strategies; ongoing governance of SKILL.md content.
- Provenance and evaluation: Store per-cell evidence (URLs, timestamps); define comparators for numeric and free text; manage reliance on LLM-as-judge where needed.
- Reliability and scalability: Caching, retries, sandboxed execution for function skills, observability/telemetry, and concurrency controls; plan for web drift and schema evolution.
- Human-in-the-loop: Critical for high-stakes domains (healthcare, finance, policy) to review outputs and approve changes to skills and schemas.
These applications leverage the paperโs key contributionsโbi-level strategy learning, external memory skill banks, and workboard-based coordinationโto transform unstructured, heterogeneous web content into trustworthy, usable tables that integrate directly into products, analyses, and decisions.
Glossary
- action-observation loop: a sequential decision-making pattern where the agent takes an action, receives an observation, and repeats until termination. Example: "Any policy T solving this task unfolds as an action-observation loop."
- asynchronous consensus: a shared agreement that emerges from non-blocking, parallel updates to a common state. Example: "yielding a form of asynchronous consensus that scales with the worker pool whilst preserving the simplicity of a plain Markdown document."
- asynchronous dispatch: launching multiple workers or tasks concurrently without waiting for each other to complete. Example: "via asynchronous dispatch"
- asynchronous worker loop: a parallel execution phase where multiple workers proceed independently and concurrently. Example: "Stage 2: Execute (asynchronous worker loop)"
- BAAI/bge-m3: a family of embedding models used for vector search and retrieval. Example: "embedding store (BAAI/bge-m3)"
- bi-level architecture: a two-layer system design separating high-level planning from low-level execution. Example: "Web2BigTable adopts a bi-level architecture in which an upper-level orchestrator decomposes the task into sub-problems and lower-level worker agents solve them in parallel."
- BM25: a classic probabilistic information retrieval ranking function for keyword search. Example: "BM25 keyword matching"
- ChromaDB: an open-source vector database used to store and query embeddings. Example: "ChromaDB embedding search"
- cross-encoder: a model that jointly encodes queryโdocument pairs to refine ranking quality. Example: "optionally refined by a cross-encoder"
- deep search: iterative, multi-hop retrieval and reasoning to answer a single complex query. Example: "In deep search, agents iteratively retrieve, read, and reason to resolve a single complex query"
- epistemic state: a representation of shared knowledge and beliefs maintained during execution. Example: "The workboard me is not merely a message relay but a shared epistemic state."
- external memory: storage outside model parameters used to persist skills or intermediate state. Example: "Both features operate entirely over external memory, leaving the underlying LLMs frozen throughout."
- file locks: mechanisms that serialize or protect concurrent writes to shared files. Example: "protected by file locks and tag partitioning."
- global utility: a scalar score summarizing overall solution quality. Example: "we adopt a scalar global utility U(X) โฌ [0, 1]"
- gold reference: the ground-truth target used for evaluation or training supervision. Example: "compares the predicted table X against the gold reference Xgold"
- Item-level F1: a metric assessing correctness at the individual cell level in a table. Example: "Item-level F1: The most granular metric"
- LLM-as-judge: an evaluation protocol where an LLM assesses correctness. Example: "Accuracy evaluated via LLM-as-judge."
- LLM-based semantic judgement: using an LLM to determine semantic equivalence for free-text cells. Example: "LLM-based semantic judgement for free text"
- long-term semantic memory: persistent skills or strategies learned across episodes and reused at inference. Example: "Long-term semantic memory: a persistent store of skills that evolves only during training and is frozen at inference."
- Memento-Skills: a mechanism for retrieving and applying reusable execution skills. Example: "the Memento-Skills mechanism [32]"
- Model Context Protocol (MCP): a protocol/server setup for managing tools and agents in context. Example: "An MCP (Model Context Protocol) server manages the worker pool"
- monotone updates: skill or memory updates that only append, never overwrite or delete past knowledge. Example: "distils each episode's trajectories into monotone updates to S. and Sw."
- orchestrator: the upper-level agent that plans and partitions tasks into subtasks. Example: "an upper-level orchestrator decomposes the task into sub-problems"
- ReAct loop: an agent pattern interleaving reasoning steps with tool-use actions. Example: "executing a ReAct loop of reasoning and tool use"
- Reciprocal Rank Fusion (RRF): a method to combine multiple ranked lists to improve retrieval. Example: "Reciprocal Rank Fusion (RRF)"
- read-write asymmetry: a coordination design where all workers can read global state but write only to scoped regions. Example: "Dynamic coordination through read-write asymmetry."
- Row-level F1: a metric evaluating whether entire rows (records) are correctly retrieved. Example: "Row-level F1: Treats each table row as a unit"
- run-verify-reflect: a closed-loop learning process that executes, evaluates, and refines skills. Example: "closed-loop run-verify-reflect process"
- schema-aligned: adhering to a predefined table structure with specified columns and types. Example: "a schema-aligned table"
- semantic retrieval: embedding-based search that matches queries and documents by meaning rather than keywords. Example: "semantic retrieval across both local and cloud-based catalogues"
- shared workboard: a globally visible, markdown-based scratchpad for coordination among agents. Example: "The shared workboard is a structured Markdown document"
- short-term working memory: transient, per-episode state used during a single run. Example: "Short-term working memory: a scratchpad that is transient within a single episode"
- singleton application context: a single shared process-level context to avoid redundant resource loading. Example: "shared across workers via a singleton application context"
- skill banks: repositories of reusable planning and execution skills consumed by agents. Example: "the two skill banks are consumed read-only"
- SkillCreator: a component that synthesizes new executable or knowledge skills on demand. Example: "the SkillCreator module leverages the worker's LLM to synthesise a novel skill"
- SkillResolver: a component that locates appropriate skills via exact match and semantic search. Example: "the SkillResolver executes a strictly prioritised search"
- Success Rate (SR): a stringent metric requiring the entire output table to match ground truth. Example: "Success Rate (SR): The most stringent metric"
- task decomposition: splitting a complex query into smaller, manageable subtasks. Example: "partially address this issue through task decomposition"
- task-router: a skill that maps query characteristics to the appropriate decomposition strategy. Example: "a task-router skill that evaluates structural properties"
- tool-call: an action invoking an external tool (e.g., search, file operation) during agent execution. Example: "Each action is either a tool-call such as a search query or file operation"
- Two-phase pipeline: a separation of training (skill learning) and inference (skill consumption). Example: "Two-phase pipeline: training and inference."
- URL normalisation: canonicalizing URLs to compare or score them consistently. Example: "URL normalisation"
- web-to-table search: constructing structured tables from open-web sources according to a given schema. Example: "web-to-table search that supports both breadth-oriented and depth-oriented instances"
- WideSearch: a benchmark for broad-coverage, structured extraction from the live web. Example: "On WideSearch [19]"
- worker agent: a lower-level agent that executes a specific subtask in parallel with peers. Example: "lower-level worker agents solve them in parallel."
- XBench-DeepSearch: a benchmark focused on deep, multi-hop web research and reasoning. Example: "XBench-DeepSearch"
Collections
Sign up for free to add this paper to one or more collections.