Papers
Topics
Authors
Recent
Search
2000 character limit reached

Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction

Published 29 Apr 2026 in cs.AI | (2604.27221v1)

Abstract: Agentic web search increasingly faces two distinct demands: deep reasoning over a single target, and structured aggregation across many entities and heterogeneous sources. Current systems struggle on both fronts. Breadth-oriented tasks demand schema-aligned outputs with wide coverage and cross-entity consistency, while depth-oriented tasks require coherent reasoning over long, branching search trajectories. We introduce \textbf{Web2BigTable}, a multi-agent framework for web-to-table search that supports both regimes. Web2BigTable adopts a bi-level architecture in which an upper-level orchestrator decomposes the task into sub-problems and lower-level worker agents solve them in parallel. Through a closed-loop run--verify--reflect process, the framework jointly improves decomposition and execution over time via persistent, human-readable external memory, with self-evolving updates to each single-agent. During execution, workers coordinate through a shared workspace that makes partial findings visible, allowing them to reduce redundant exploration, reconcile conflicting evidence, and adapt to emerging coverage gaps. Web2BigTable sets a new state of the art on WideSearch, reaching an Avg@4 Success Rate of \textbf{38.50} ($7.5\times$ the second best at 5.10), Row F1 of \textbf{63.53} (+25.03 over the second best), and Item F1 of \textbf{80.12} (+14.42 over the second best). It also generalises to depth-oriented search on XBench-DeepSearch, achieving 73.0 accuracy. Code is available at https://github.com/web2bigtable/web2bigtable.

Summary

  • The paper introduces a bi-level multi-agent framework where an orchestrator decomposes queries and worker agents perform parallel extraction.
  • It employs a run-verify-reflect loop to self-evolve skills, significantly improving extraction success rates over static baselines.
  • Experimental results demonstrate a 7.5x performance gain in breadth-oriented tasks and robust multi-hop reasoning in depth-oriented searches.

Web2BigTable: Bi-Level Multi-Agent Framework for Internet-Scale Web-to-Table Extraction

Introduction and Problem Landscape

Web2BigTable addresses the increasing demand for agentic web search systems capable of both deep, multi-hop reasoning over complex queries (depth-oriented search) and large-scale, schema-aligned aggregation across many heterogeneous sources (breadth-oriented search). Prior frameworks, including both monolithic and hierarchical agent architectures, either suffer from context window bottlenecks, compounded error propagation, or static, non-adaptive decomposition strategies, constraining their efficacy in structuring information at internet scale. The essential challenge is to design a system that maintains high inter-row consistency, wide entity coverage, and supports robust, adaptive coordination between concurrent agents under an evolving utility landscape.

System Architecture

Web2BigTable introduces a bi-level, memory-mediated multi-agent framework with two principal layers:

  1. Orchestrator: An upper-level reasoning agent executes task decomposition, classifying user queries into structural archetypes (e.g., split-by-entity, split-by-category, split-by-source) and partitioning the target schema across sub-problems, leveraging a persistent orchestrator skill bank (SoS_o). For each decomposition, the orchestrator instantiates natural-language subtask specifications, output constraints, and initializes the shared working memory (workboard).
  2. Worker Agents: A pool of asynchronous, parallel LLM workers, each with independent trajectories, resolves sub-tasks using a shared, evolving repository of execution skills (SwS_w). Workers coordinate via a globally readable, regionally writable Markdown workboard, which preserves all intermediate states, partial outputs, and enables live adaptation to emerging evidence and detected coverage gaps.

Adaptation throughout occurs solely through external, persistent, human-readable memoryโ€”including all acquired skills and strategiesโ€”eschewing any fine-tuning or gradient-based updates on the underlying LLMs. This separation of reasoning and learning, and the decoupling of working and long-term memory, allows for both rapid coordination and slow-timescale skill evolution.

Self-Evolving Skill Learning

The core innovation lies in the run-verify-reflect training loop:

  • Run: For every task, the orchestrator and workers execute decomposition and extraction, recording all tool calls, reasoning chains, and partial tables.
  • Verify: Outputs are evaluated against gold-standard tables using cell-type-specific comparators (exact match, numeric tolerance, URL normalization, semantic LLM judgement).
  • Reflect: Structured error reports are aggregated, clustering recurring decomposition or retrieval failures, which are then synthesized into new skill entries by LLM-powered reflection routines. These skills are versioned, human-editable, and criteria-based. Worker-level adaptation includes autonomous discovery (via BM25, embedding search) and dynamic synthesis of new skills when gaps are encountered, supporting instant propagation of improved Python tool scripts and knowledge entries across the agent pool.

Persistent memory separation enables a clean distinction between across-task skill consolidation (SoS_o, SwS_w) and per-query state (mem_e), avoiding catastrophic forgetting and amplifying framework adaptability.

Parallel and Asynchronous Coordination

The workboard mechanism is crucial for intra-episode agent collaboration. Workers asynchronously read the global state, observe peer progress, avoid redundant searches, and dynamically redirect their local plans to fill detected coverage or verification gaps. Write access is slot-tag restricted, while the global state remains readable, facilitating asynchronous consensus with minimal coordination overhead and enabling natural information cascades as faster workers supply context for those lagging behind.

This configuration effectively handles high-cardinality schema tasks (hundreds of entities/columns), with dynamic subtask reallocation (e.g., gap detection, secondary verification rounds) triggered directly by orchestrator-learned decomposition rules and workboard evidence.

Experimental Results

WideSearch (Breadth-Oriented)

Web2BigTable delivers a Success Rate of 38.50 (7.5x over nearest multi-agent baseline at 5.10), Row F1 of 63.53 (+25.03 improvement), and Item F1 of 80.12 (+14.42 improvement). These gains are independent of backbone LLM strength: ablating learned orchestrator skills collapses performance (Success Rate drops from 38.50 to 7.0). The main source of superiority is the learned, task-adaptive decomposition, which eliminates the systematic coverage gaps of generic or statically-designed plans.

Single-agent or non-adaptive multi-agent baselines, including those leveraging models strictly stronger than those in Web2BigTable (e.g., Claude-4.5-Sonnet, GPT-5 High), plateau far below these results. Representative tasks (e.g., extracting all Taylor Swift concert events, or full AMD Zen CPU release tables) see single agents retrieving only โ‰ˆ\approx20\% of the required items, versus 94โ€“96% for the bi-level framework with learned strategies.

XBench-DeepSearch (Depth-Oriented)

Accuracy reaches 73.0, outperforming proprietary and open-source baselines by 1โ€“17 points and demonstrating robust generalization to multi-hop, cross-source reasoning. Disabling learned orchestrator skills results in a 32-point drop (73.0 to 41.0), affirming the centrality of self-evolved decomposition.

Ablation and Case Studies

  • Component ablation reveals orchestrator skill evolution as the dominant contributor, with shared workboard (coordination) and worker skill progress as significant but secondary drivers.
  • Case studies confirm that generic time-based decompositions are insufficient for high-cardinality extraction tasks; learned entity- or category-based splits, with dedicated verification workers, achieve nearly complete schema coverage.
  • Live skill evolution enables not only rapid exploitation of new evidence but resilient error recovery (e.g., auto-repair/reflection on tool failure).

Implications and Theoretical Extensions

Web2BigTable empirically supports the thesis that large-scale web-to-table extraction is a fundamentally dual-memory search-and-aggregation problem: monolithic, context-limited agents are provably bottlenecked. The results indicate that persistent, modular skill memories, updated by high-level semantic error analysis rather than gradient descent, enable orders of magnitude gains on extraction coverage, reliability, and adaptability.

The architecture generalizes to any structured schema search under open-world information constraints and provides a template for multi-agent, coordination-centric LLM collectives that can be extended to more complex Stackelberg or potential games with rigorous convergence guarantees. The authors propose future work (Memento-Team) to formalize bilevel Stackelberg games with memory-based adaptation, offering theoretical convergence of the system under stochastic reflection and bounded communication.

From a practical standpoint, the memory-banked, script-enhanced, multi-agent paradigm provides an immediately usable blueprint for next-generation agentic systems in enterprise intelligence, scientific discovery, and any domain requiring comprehensive, schema-consistent, verifiable information integration.

Conclusion

Web2BigTable establishes a new empirical upper bound for internet-scale, agentic web search and extraction by integrating self-evolving, memory-driven task decomposition and robust, asynchronously coordinated tool use. The bilevel designโ€”combining orchestrator-level strategy evolution, per-worker skill adaptation, and persistent memory mediationโ€”enables performance unattainable by either monolithic or static multi-agent systems, independent of the underlying LLM backbone. This framework advances the design principles for scalable, verifiable, and self-improving agentic systems, providing direct pathways for future development in both research and applied AI deployment.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Explaining โ€œWeb2BigTableโ€ in simple terms

1) What is this paper about?

This paper introduces Web2BigTable, an AI system that can search the internet and turn what it finds into neat, fact-checked tables. Itโ€™s built for two kinds of jobs:

  • โ€œDeepโ€ search: carefully following clues to answer one tricky question.
  • โ€œWideโ€ search: gathering lots of facts about many things and organizing them into a big table.

Think of it like a well-run group project: one โ€œcoachโ€ plans the work, and many โ€œteam membersโ€ collect and check information in parallel, then combine it all into a clean, consistent table.

2) What were the main goals?

The researchers wanted to solve two everyday problems for AI that searches the web:

  • Wide search: Can the AI cover a lot of items (like all concerts or all products), keep the format consistent, and avoid missing pieces?
  • Deep search: Can the AI think through a chain of clues to get one correct answer?

They asked: Can a team-based AI, with good planning and shared memory, do both jobs better than a single AI working alone?

3) How does the system work? (With simple analogies)

Web2BigTable uses a โ€œbi-levelโ€ setupโ€”two levels that work together:

  • The orchestrator (coach): Breaks a big task into smaller, clear subtasks. For example, if the goal is to list all concerts from 2010โ€“2025, the coach might assign different years or regions to different team members.
  • Worker agents (team members): Each worker searches the web for their assigned part, checks facts, and fills in their portion of the table.

Two kinds of โ€œmemoryโ€ help them cooperate:

  • Short-term โ€œworkboardโ€ (shared whiteboard): A live document everyone can read. It shows whatโ€™s done, whatโ€™s missing, and partial findings. This lets workers:
    • Avoid doing the same work twice,
    • Spot gaps (like missing columns or years),
    • Share good sources and fix conflicts (if two sources disagree).
  • Long-term โ€œskillsโ€ (playbook): The system saves successful strategies in human-readable notes and tools. Over time, it learns better ways to split tasks (coach skills) and better ways to search and verify information (worker skills). Importantly, the AI models themselves arenโ€™t retrained; instead, the system improves by updating these external, text-based skill notes and small tools.

Training vs. using the system:

  • Training (practice): The system runs, checks its output against correct answers, and then โ€œreflectsโ€ on mistakesโ€”this run-verify-reflect cycle updates the playbook.
  • Inference (game day): The system uses the playbook as-is to answer new questions without changing the models.

Technical terms in everyday language:

  • Schema: The column layout of the table (e.g., Date, City, Venue).
  • Verification: Double-checking each cell against web sources.
  • Decomposition: Breaking the big task into smaller, manageable chunks.
  • F1 score and Success Rate: Ways of measuring correctness. Success Rate is an all-or-nothing score for entire tables. F1 is a balanced measure of how many correct items were found and how many mistakes were avoided, either per row or per cell.

4) What did they find, and why is it important?

Main results:

  • On a wide-search benchmark called WideSearch, Web2BigTable set a new state-of-the-art:
    • Success Rate: 38.50, which is about 7.5 times higher than the next best system (5.10).
    • Row F1: 63.53, which is 25 points higher than the next best.
    • Item (cell) F1: 80.12, which is over 14 points higher than the next best.
  • On a deep-search benchmark (XBench-DeepSearch), it achieved 73.0 accuracy, showing it also handles clue-following tasks well.

Why this matters:

  • Most earlier systems struggled either with carefully reasoning through long chains (deep search) or with covering many items consistently (wide search). Web2BigTable excels at both by combining smart planning with teamwork and shared memory.

What made the biggest difference (from their tests that โ€œablateโ€ features to see whatโ€™s essential):

  • Learned orchestrator skills (the coachโ€™s playbook) mattered most. Removing them caused the largest drop in performance.
  • The shared workboard (the teamโ€™s whiteboard) also mattered a lot; without it, workers couldnโ€™t see each otherโ€™s progress or fill gaps well.
  • Worker skill evolution (improving search-and-verify tools) provided steady extra gains.

A key insight:

  • The strong performance came from the frameworkโ€™s design (the coach-team setup, shared memory, and reflection loop), not just from using powerful AI models. Even with smaller, cheaper models, the framework beat other systems that used bigger models.

5) Whatโ€™s the impact?

  • Better web-to-table tools: This system can help build accurate, large tables from the live webโ€”useful for research, business reports, event listings, product comparisons, and more.
  • Reliable AI teamwork: It shows how AI โ€œteamsโ€ with a clear leader, a shared workspace, and a growing playbook can outperform solo AI, especially on big, real-world tasks.
  • Safer, more transparent improvements: Because the system improves using human-readable notes and tools (not by retraining models), itโ€™s easier to audit, edit, and control how it learns.
  • General approach for future agents: The bi-level plan + shared memory + run-verify-reflect pattern could be applied to many other AI tasks that need both careful reasoning and broad coverage.

In short, Web2BigTable demonstrates that good organization, collaboration, and steady learning from mistakes can make AI much better at finding and structuring information from the internetโ€”just like a well-coached team beats a talented solo player.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a focused list of what remains unresolved or insufficiently explored in the paper, formulated to guide concrete follow-up work:

  • Benchmark scope and generalization
    • How well do the learned orchestrator and worker skills transfer to domains beyond WideSearch/XBench (e.g., scientific literature, finance filings, biomedical sites), new schemas, and languages outside English/Chinese?
    • Sensitivity to the number and diversity of training queries (only 20 used): Whatโ€™s the sample-efficiency curve, and how does domain shift after training affect performance?
  • Scalability and systems limits
    • Coordination scalability beyond ~10 workers and hundreds of rows: throughput, latency, and contention when scaling to thousands of rows or 50โ€“100+ concurrent workers.
    • Workboard as a single Markdown file with file locks: risks of I/O bottlenecks, lock contention, and failure modes in distributed/multi-machine settings; need for sharding or transactional storage.
    • Context growth management: strategies for pruning/summarizing the workboard as it grows (to avoid context overflow) and the effect on quality.
  • Skill bank growth and retrieval
    • Retrieval quality and latency as skill banks grow (skill bloat): how to prevent retrieval noise, redundancy, or conflicts among overlapping skills; policies for de-duplication, ranking, decay, and archival.
    • Negative transfer: whether larger, noisier skill banks degrade performance; mechanisms for skill validation, rollback, and automated tests before promotion to the shared bank.
  • Robustness to real-world web variability
    • Handling dynamic, JavaScript-heavy, paywalled, or anti-bot sites (no headless browser or rendering engine described); success rates and fallbacks in those scenarios.
    • Temporal robustness: stability over time as web pages change (no longitudinal evaluation).
    • Resilience to network errors, rate limits, captchas, and intermittent tool failures; retry policies and caching strategies are unspecified.
  • Safety and security
    • Prompt injection and data-poisoning defenses for web content (e.g., instruction-hijacking from pages); currently no threat model or mitigation protocols.
    • Secure execution of auto-generated code and โ€œbashโ€ tools: sandboxing, permission boundaries, dependency control, and prevention of arbitrary command execution beyond AST validation.
    • Supply-chain risks from โ€œcloud skillsโ€ catalogs (8,000+ preconfigured skills): provenance, trust, and vetting processes.
  • Verification, provenance, and trust
    • Inference-time verification is disabled (no run/verify loop): impact on factuality and error rates without post-hoc checks; feasibility of a lightweight per-cell verification pass at inference.
    • Per-cell provenance delivery: how URLs/evidence are captured and presented to users for auditability and reproducibility.
    • Confidence/uncertainty estimation per cell/row and calibrated trust signals for end users.
  • Conflict resolution and consistency
    • Explicit mechanisms for reconciling conflicting sources and ties (beyond โ€œvalidation passโ€): how to adjudicate disagreements, weight sources, or enforce schema-level constraints and cross-attribute consistency.
    • Duplicate detection and removal at aggregation (authors state no post-processing): rates of duplication and concrete methods to eliminate them without harming recall.
  • Evaluation design and metrics
    • Cost, latency, and energy usage vs. baselines (absent): end-to-end efficiency profiles and trade-offs for different worker counts and model choices.
    • Reproducibility under nondeterminism from asynchronous execution; variance across runs beyond Avg@4/Max@4 and how to tighten variance.
    • Reliability of LLM-as-judge for free-text/item scoring: sensitivity analyses, inter-judge consistency, and impact on conclusions.
    • Error taxonomy beyond aggregate F1/SR: granular analysis of failure modes (coverage gaps, extraction errors, mis-typing, temporal drift, source conflicts).
  • Model and tool choices
    • Impact of stronger/weaker LLMs and different tool stacks (e.g., headless browsers, structured extractors); systematic scaling laws for model capability vs. framework gains.
    • Ablations on retrieval pipelines (BM25 vs. embeddings vs. reranker), top-k settings, and their interactions with worker performance.
  • Human factors and ethics
    • User-facing usability and interpretability studies: do workboard traces and outputs support human oversight and rapid error correction?
    • Ethical and legal considerations for live web scraping (robots.txt compliance, ToS, jurisdictional constraints) and handling of sensitive content.
  • Lifecycle management
    • Skill rot and drift: detecting outdated or brittle skills as sites evolve; scheduling re-validation and automated refitting of skills.
    • Governance for editing SKILL.md (human-in-the-loop review, versioning policies, audit trails) and criteria for promoting skills from local to shared/global banks.
  • Extensibility
    • Support for richer target structures beyond flat tables (nested schemas, graphs, temporal series, units/normalization) and constraints-aware extraction (e.g., integrity constraints, type systems).
    • Integration with structured knowledge bases to improve recall/precision and to enforce global consistency.
  • Failure containment and recovery
    • Systematic strategies for recovering from partial failures (tool timeouts, worker crashes) without losing global progress; checkpointing and idempotent aggregation.

These gaps suggest concrete next steps: scale and latency stress tests, adversarial robustness evaluations, inference-time verification/citation mechanisms, security-hardening with sandboxing and injection defenses, skill lifecycle governance, richer conflict-resolution logic, and broader cross-domain/language benchmarks with cost-quality trade-off analyses.

Practical Applications

Overview

Web2BigTable introduces a bi-level, multi-agent web-to-table framework that automates large-scale, schema-aligned information extraction from the live web. Its key innovationsโ€”task-adaptive decomposition (orchestrator skills), parallel worker coordination via a shared workboard, and self-evolving, human-readable skill banks (no model fine-tuning)โ€”enable both broad-coverage extraction and deep, multi-hop search. These capabilities translate into practical workflows for building and maintaining structured datasets with provenance, at scale.

Below are concrete applications across industry, academia, policy, and daily life, grouped by immediate deployability versus longer-term opportunities that require additional engineering, scaling, or domain adaptation.

Immediate Applications

These can be deployed with current capabilities (as described in the paper), assuming access to web search tools/APIs, basic scraping infrastructure, and lightweight LLMs for orchestration and worker roles.

  • Competitive intelligence dashboards โ€” sector: software/enterprise, finance
    • What: Continuously compile tables of competitorsโ€™ product launches, pricing tiers, feature matrices, store locations, job postings, or press releases across brands/regions.
    • How it maps: Orchestrator splits by company/product lines; workers extract attributes with cell-level verification; workboard avoids redundant sources and fills coverage gaps.
    • Tools/products/workflows: โ€œWeb-to-Table CI Agentโ€; scheduled agentic ETL into Snowflake/BigQuery; versioned SKILL.md for domain-specific decomposition.
    • Assumptions/dependencies: Respect robots.txt/ToS; handle anti-bot pages; maintain search API quotas; curate 10โ€“20 gold tasks to seed strategy memory.
  • E-commerce catalog enrichment โ€” sector: retail, marketplaces
    • What: Aggregate product specs, SKUs, prices, availability, and warranty terms from manufacturers, retailers, and distributors into a unified catalog.
    • How it maps: Decompose by brand/category; workers standardize attributes and verify via multi-source evidence; shared workboard propagates high-quality source URLs and templates.
    • Tools/products/workflows: โ€œCatalog Extractorโ€ connectors; schema validators using Item-F1-like comparators; CI/CD on skill banks.
    • Assumptions/dependencies: Dynamic pages, variants, and region-specific pricing; compliance with scraping; structured schema design.
  • ESG and event data extraction โ€” sector: finance
    • What: Pull ESG metrics, filings, earnings calendars, insider transactions, M&A events, and sanctions updates into structured tables with provenance.
    • How it maps: Orchestrator partitions by issuer and data type; workers retrieve from regulators, company IR pages, trusted aggregators; run-verify-reflect improves coverage over time.
    • Tools/products/workflows: โ€œProvenance-grade Data Feedsโ€ for BI; lineage storage per cell (URLs, timestamps).
    • Assumptions/dependencies: Disambiguation of entities; paywalled content; frequent web changesโ€”cache and snapshot for auditability.
  • Public procurement and grants consolidation โ€” sector: government/policy
    • What: Aggregate tenders, awards, suppliers, amounts, and timelines across agencies and jurisdictions.
    • How it maps: Decompose by geography/agency; workers coordinate to reconcile duplicates or missing fields.
    • Tools/products/workflows: โ€œGovSpend Table Builderโ€; open-data enrichment pipeline.
    • Assumptions/dependencies: Heterogeneous portals; rate limits; legal constraints on scraping.
  • Pharmacovigilance signals and clinical trial tracking โ€” sector: healthcare
    • What: Build tables of clinical trials (status, endpoints, locations) and adverse event reports from registries and safety communications.
    • How it maps: Partition by molecule/indication; workers standardize medical terminology and verify cross-source evidence.
    • Tools/products/workflows: โ€œClinical Trials Extractorโ€ feeding analytics; compliance-grade provenance logs.
    • Assumptions/dependencies: Medical ontology mapping; careful handling of ambiguous free text; data-use policies.
  • Academic literature scaffolding โ€” sector: academia/education
    • What: Create structured bibliographies (authors, venue, year, DOI), conference schedules, dataset indexes, and benchmarks across fields.
    • How it maps: Orchestrator splits by venue/time; workers use search and archive APIs; workboard shares canonical sources and formatting.
    • Tools/products/workflows: โ€œWeb-to-Table Lit Review Assistantโ€; export to Zotero/CSV.
    • Assumptions/dependencies: PDF parsing often required (extend skills); access to metadata APIs.
  • Real estate and infrastructure registries โ€” sector: real estate, energy
    • What: Aggregate listings or project registries (e.g., renewable energy projects: capacity, location, status) from public portals, utilities, and developers.
    • How it maps: Decompose by region/asset class; workers normalize units and resolve conflicts.
    • Tools/products/workflows: โ€œAsset Registry Builderโ€ with scheduled refresh.
    • Assumptions/dependencies: Geocoding integration; frequent updates and de-duplication.
  • Cybersecurity knowledge tables โ€” sector: software/security
    • What: Build and maintain tables of CVEs (severity, affected versions), exploit PoCs, and patch availability from NVD, vendor advisories, and trusted feeds.
    • How it maps: Partition by vendor/product; workers verify across advisory and NVD IDs; workboard coordinates edge cases.
    • Tools/products/workflows: โ€œVuln Table Feedโ€ for SOC dashboards; alerting when new rows appear.
    • Assumptions/dependencies: Rate-limited APIs; identity resolution across trackers.
  • Journalism/fact-checking datasets โ€” sector: media
    • What: Compile structured timelines of events, public statements, and sources for investigative pieces.
    • How it maps: Split by person/event; workers record sources per cell; orchestrator validates row counts and consistency.
    • Tools/products/workflows: โ€œFact-Checked Timeline Builderโ€ with citation exports.
    • Assumptions/dependencies: Editorial review remains required; manage dynamic and conflicting sources.
  • Personal comparison tables โ€” sector: daily life/consumer
    • What: Create up-to-date tables of phone/plans, travel options (routes, baggage policies), or credit cards (fees, perks), grounded in live web sources.
    • How it maps: Decompose by brand/route; workers standardize attributes; workboard avoids duplicate vendor checks.
    • Tools/products/workflows: Consumer โ€œCompare-Anythingโ€ assistant with provenance.
    • Assumptions/dependencies: Frequent changes; paywalls; ensure transparent source links.
  • Internal knowledge base indexing โ€” sector: enterprise IT
    • What: Convert intranet policies, FAQs, and service catalogs into schema-aligned tables for search and governance.
    • How it maps: Use the same framework on private web/docs; skills stored as SKILL.md to codify decomposition per department.
    • Tools/products/workflows: โ€œAgentic KB Curatorโ€ with access control; audit-ready lineage.
    • Assumptions/dependencies: Authentication to internal systems; privacy and security controls; no external scraping required.

Long-Term Applications

These are promising but require further research, scaling, or integration (e.g., richer browsers, stronger provenance tracking, domain ontologies, legal agreements, or robust continuous learning at scale).

  • Enterprise knowledge graph construction and refresh โ€” sector: cross-industry
    • What: From repeated web-to-table extractions, build and maintain knowledge graphs (entities, relations) with cell-level provenance.
    • Enablers: Extend workers with entity linking, ontologies, and deduplication; schedule run-verify-reflect for continual updates.
    • Dependencies: Scalable storage/graph DBs; entity resolution; advanced provenance/versioning; robust diffing over the live web.
  • Regulatory monitoring and automated compliance reporting โ€” sector: finance, healthcare, energy, telecom
    • What: Track new rules/guidance, map obligations to entities/processes, and produce structured compliance matrices.
    • Enablers: Decomposition by jurisdiction/topic; skill banks per regulator; escalation workflows for human review.
    • Dependencies: Legal review; subscription/paywalled sources; auditable change logs.
  • Scientific knowledge base curation at scale โ€” sector: academia/biotech
    • What: Extract structured facts (methods, datasets, results) across literature to power meta-analyses and discovery.
    • Enablers: Robust PDF/HTML parsing skills, domain schemas, and semantic comparators beyond Item-F1.
    • Dependencies: Publisher licenses; accurate schema/ontology mapping; handling figures/tables; evaluation beyond LLM-as-judge.
  • Real-time market and risk intelligence โ€” sector: finance/supply chain
    • What: Agents ingest news, filings, social signals to update tables and trigger alerts (e.g., supply chain disruptions, credit events).
    • Enablers: Streaming ingestion; incremental extraction; priority scheduling per topic; confidence scoring.
    • Dependencies: Low-latency pipelines; dedup at high velocity; governance for false positives.
  • Agentic ETL for data warehouses and BI โ€” sector: software/data platforms
    • What: A managed โ€œWeb-to-Tableโ€ connector that schedules, extracts, validates, and loads structured web data with lineage into warehouses.
    • Enablers: Productizing the orchestrator/worker/workboard pattern with retries, caching, and schema drift detection.
    • Dependencies: Enterprise-grade auth, observability, sandboxing for function skills, SLAs.
  • Skill bank marketplaces and Org memory ops (AgentOps) โ€” sector: software tooling
    • What: Share, version, and govern SKILL.md strategies and execution skills across teams; route tasks to best skills.
    • Enablers: Registry, semantic retrieval, approval workflows; telemetry-driven refinement.
    • Dependencies: Security review for executable skills; IP/licensing for shared skills; compatibility across LLM backbones.
  • Search engines with structured โ€œwideโ€ answers โ€” sector: search
    • What: Answer complex, breadth-oriented queries with verified tables (e.g., โ€œall grants for X between 2015โ€“2025โ€) and provenance.
    • Enablers: Tight integration with indexing, caching, and citation UX; robust coverage guarantees.
    • Dependencies: Costs, latency constraints, quality thresholds; safe handling of dynamic/long-tail pages.
  • Civic tech and open-data enrichment โ€” sector: government/civil society
    • What: Fill gaps in public datasets (e.g., schools, transit, environmental permits) and validate records against the web.
    • Enablers: Community-managed skill banks; transparent workboard logs; reproducible pipelines.
    • Dependencies: Legal compliance; volunteer/human-in-the-loop validation; sustainable hosting.
  • RPA-triggered business workflows โ€” sector: operations
    • What: Use extracted tables to drive downstream actions (e.g., procurement shortlist creation, vendor outreach).
    • Enablers: Confidence thresholds, human approvals, and integration with RPA tools.
    • Dependencies: Strong provenance and audit trails; error mitigation; secure action execution.
  • Personal digital steward with persistent tables โ€” sector: consumer
    • What: Maintain up-to-date tables of subscriptions, bills, travel, school activities, and renewals; notify on changes.
    • Enablers: OAuth to user accounts; privacy-preserving local skill execution; periodic sync.
    • Dependencies: Data privacy/security; multi-source auth; UX for consent and transparency.

Cross-Cutting Assumptions and Dependencies

  • Data access and legality: Respect robots.txt/ToS, obtain needed licenses, manage paywalls, and adhere to privacy/security policies.
  • Tooling stack: Search/browse APIs, headless browser for dynamic pages, BM25/embedding stores (e.g., ChromaDB), file locking for workboard, and LLMs capable of tool use.
  • Skill bank seeding: Small but curated set of gold-standard tasks (e.g., ~20 per domain) to learn decomposition strategies; ongoing governance of SKILL.md content.
  • Provenance and evaluation: Store per-cell evidence (URLs, timestamps); define comparators for numeric and free text; manage reliance on LLM-as-judge where needed.
  • Reliability and scalability: Caching, retries, sandboxed execution for function skills, observability/telemetry, and concurrency controls; plan for web drift and schema evolution.
  • Human-in-the-loop: Critical for high-stakes domains (healthcare, finance, policy) to review outputs and approve changes to skills and schemas.

These applications leverage the paperโ€™s key contributionsโ€”bi-level strategy learning, external memory skill banks, and workboard-based coordinationโ€”to transform unstructured, heterogeneous web content into trustworthy, usable tables that integrate directly into products, analyses, and decisions.

Glossary

  • action-observation loop: a sequential decision-making pattern where the agent takes an action, receives an observation, and repeats until termination. Example: "Any policy T solving this task unfolds as an action-observation loop."
  • asynchronous consensus: a shared agreement that emerges from non-blocking, parallel updates to a common state. Example: "yielding a form of asynchronous consensus that scales with the worker pool whilst preserving the simplicity of a plain Markdown document."
  • asynchronous dispatch: launching multiple workers or tasks concurrently without waiting for each other to complete. Example: "via asynchronous dispatch"
  • asynchronous worker loop: a parallel execution phase where multiple workers proceed independently and concurrently. Example: "Stage 2: Execute (asynchronous worker loop)"
  • BAAI/bge-m3: a family of embedding models used for vector search and retrieval. Example: "embedding store (BAAI/bge-m3)"
  • bi-level architecture: a two-layer system design separating high-level planning from low-level execution. Example: "Web2BigTable adopts a bi-level architecture in which an upper-level orchestrator decomposes the task into sub-problems and lower-level worker agents solve them in parallel."
  • BM25: a classic probabilistic information retrieval ranking function for keyword search. Example: "BM25 keyword matching"
  • ChromaDB: an open-source vector database used to store and query embeddings. Example: "ChromaDB embedding search"
  • cross-encoder: a model that jointly encodes queryโ€“document pairs to refine ranking quality. Example: "optionally refined by a cross-encoder"
  • deep search: iterative, multi-hop retrieval and reasoning to answer a single complex query. Example: "In deep search, agents iteratively retrieve, read, and reason to resolve a single complex query"
  • epistemic state: a representation of shared knowledge and beliefs maintained during execution. Example: "The workboard me is not merely a message relay but a shared epistemic state."
  • external memory: storage outside model parameters used to persist skills or intermediate state. Example: "Both features operate entirely over external memory, leaving the underlying LLMs frozen throughout."
  • file locks: mechanisms that serialize or protect concurrent writes to shared files. Example: "protected by file locks and tag partitioning."
  • global utility: a scalar score summarizing overall solution quality. Example: "we adopt a scalar global utility U(X) โ‚ฌ [0, 1]"
  • gold reference: the ground-truth target used for evaluation or training supervision. Example: "compares the predicted table X against the gold reference Xgold"
  • Item-level F1: a metric assessing correctness at the individual cell level in a table. Example: "Item-level F1: The most granular metric"
  • LLM-as-judge: an evaluation protocol where an LLM assesses correctness. Example: "Accuracy evaluated via LLM-as-judge."
  • LLM-based semantic judgement: using an LLM to determine semantic equivalence for free-text cells. Example: "LLM-based semantic judgement for free text"
  • long-term semantic memory: persistent skills or strategies learned across episodes and reused at inference. Example: "Long-term semantic memory: a persistent store of skills that evolves only during training and is frozen at inference."
  • Memento-Skills: a mechanism for retrieving and applying reusable execution skills. Example: "the Memento-Skills mechanism [32]"
  • Model Context Protocol (MCP): a protocol/server setup for managing tools and agents in context. Example: "An MCP (Model Context Protocol) server manages the worker pool"
  • monotone updates: skill or memory updates that only append, never overwrite or delete past knowledge. Example: "distils each episode's trajectories into monotone updates to S. and Sw."
  • orchestrator: the upper-level agent that plans and partitions tasks into subtasks. Example: "an upper-level orchestrator decomposes the task into sub-problems"
  • ReAct loop: an agent pattern interleaving reasoning steps with tool-use actions. Example: "executing a ReAct loop of reasoning and tool use"
  • Reciprocal Rank Fusion (RRF): a method to combine multiple ranked lists to improve retrieval. Example: "Reciprocal Rank Fusion (RRF)"
  • read-write asymmetry: a coordination design where all workers can read global state but write only to scoped regions. Example: "Dynamic coordination through read-write asymmetry."
  • Row-level F1: a metric evaluating whether entire rows (records) are correctly retrieved. Example: "Row-level F1: Treats each table row as a unit"
  • run-verify-reflect: a closed-loop learning process that executes, evaluates, and refines skills. Example: "closed-loop run-verify-reflect process"
  • schema-aligned: adhering to a predefined table structure with specified columns and types. Example: "a schema-aligned table"
  • semantic retrieval: embedding-based search that matches queries and documents by meaning rather than keywords. Example: "semantic retrieval across both local and cloud-based catalogues"
  • shared workboard: a globally visible, markdown-based scratchpad for coordination among agents. Example: "The shared workboard is a structured Markdown document"
  • short-term working memory: transient, per-episode state used during a single run. Example: "Short-term working memory: a scratchpad that is transient within a single episode"
  • singleton application context: a single shared process-level context to avoid redundant resource loading. Example: "shared across workers via a singleton application context"
  • skill banks: repositories of reusable planning and execution skills consumed by agents. Example: "the two skill banks are consumed read-only"
  • SkillCreator: a component that synthesizes new executable or knowledge skills on demand. Example: "the SkillCreator module leverages the worker's LLM to synthesise a novel skill"
  • SkillResolver: a component that locates appropriate skills via exact match and semantic search. Example: "the SkillResolver executes a strictly prioritised search"
  • Success Rate (SR): a stringent metric requiring the entire output table to match ground truth. Example: "Success Rate (SR): The most stringent metric"
  • task decomposition: splitting a complex query into smaller, manageable subtasks. Example: "partially address this issue through task decomposition"
  • task-router: a skill that maps query characteristics to the appropriate decomposition strategy. Example: "a task-router skill that evaluates structural properties"
  • tool-call: an action invoking an external tool (e.g., search, file operation) during agent execution. Example: "Each action is either a tool-call such as a search query or file operation"
  • Two-phase pipeline: a separation of training (skill learning) and inference (skill consumption). Example: "Two-phase pipeline: training and inference."
  • URL normalisation: canonicalizing URLs to compare or score them consistently. Example: "URL normalisation"
  • web-to-table search: constructing structured tables from open-web sources according to a given schema. Example: "web-to-table search that supports both breadth-oriented and depth-oriented instances"
  • WideSearch: a benchmark for broad-coverage, structured extraction from the live web. Example: "On WideSearch [19]"
  • worker agent: a lower-level agent that executes a specific subtask in parallel with peers. Example: "lower-level worker agents solve them in parallel."
  • XBench-DeepSearch: a benchmark focused on deep, multi-hop web research and reasoning. Example: "XBench-DeepSearch"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 70 likes about this paper.