AGGBench is a benchmark that assesses systems performing entity-level aggregation queries, focusing on exhaustive evidence retrieval and strict completeness.
It combines a core corpus of research papers with noise documents to simulate realistic, large-scale challenges for multi-chunk evidence identification.
The evaluation employs metrics like chunk-level coverage, ACE, and NACE to provide actionable insights into system performance in terms of recall and accuracy.
AGGBench is a benchmark designed to evaluate the completeness and evidence coverage of systems performing entity-level aggregation queries over unstructured text corpora. Unlike typical question answering tasks, aggregation queries require systems to exhaustively identify all entities satisfying complex, compositional conditions, thereby placing stringent demands on evidence retrieval, disambiguation, and aggregation processes. AGGBench is corpus-bounded, explicitly prohibiting the use of external knowledge and focusing evaluation on the ability to “find all” such entities within realistic, large-scale, noisy corpora (Zhu et al., 1 Feb 2026).
1. Formalization of Aggregation over Unstructured Text
AGGBench targets the formal problem of entity-level aggregation querying under a strict completeness regime. The corpus C={c1,…,cM} is partitioned into M text chunks; E(C) denotes all entity mentions across C. Each query q specifies:
an entity type T
a predicate set Φ={ϕ1,…,ϕm}, each a boolean condition on entities of type T, satisfied only if there is explicit evidence in C
The exact answer set is:
Ans(q,C)={e∈E(C)∣type(e)=T∧∀ϕi∈Φ:ϕi(e)=true}
where Ans(q,C)=c∈C⋃Ans(q,c). For each e, at least one supporting chunk evidencing each predicate must be found. AGGBench emphasizes strict recall: all satisfying entities and corresponding evidentiary chunks must be recovered, not merely plausible answers.
The key evaluation metric is evidence coverage at the chunk level:
Coverage(q)=∣G(q)∣∣R(q)∩G(q)∣
where G(q) is the gold set of evidence chunks, and R(q) is the set returned by the system.
2. Benchmark Construction and Annotation
The construction of AGGBench is designed to enable completeness-oriented evaluation under realistic, noisy conditions.
Corpus Design
Core corpus: 45 research papers from the “graph retrieval–augmented generation” literature (e.g., NeurIPS, ICLR), chunked into 200–300-token segments, totaling 4,755 chunks.
Expansion: 11,539 unrelated, noise-inducing documents were added, yielding a final corpus of 16,294 chunks. BM25 proximity filtering ensured that no new satisfying entities were introduced by noise documents.
Query and Condition Generation
Entity types (T) were extracted and ranked by frequency, with manual curation to remove ambiguous categories.
Conditions (ϕ) were mined via high-frequency descriptive phrases (e.g., “used for multi-hop QA,” “applied to legal domain”), then manually refined for compositionality and unambiguous meaning.
Resulting queries are natural-language prompts about entity counts, such as:
“How many datasets are used for multi-hop question answering?”
“How many papers apply to the legal domain?”
The benchmark comprises 362 queries: 100 base (single-condition), and 262 composite (multiple AND/OR conditions).
Evidence Annotation Workflow
Annotation is two-stage:
LLM pre-annotation: A LLM annotates each (query, chunk) pair as positive/negative and extracts candidate entities, filtering 90% as clear negatives.
Human verification: Annotators review and correct LLM outputs, ensure adequate evidence grounding for each entity, and consolidate multi-chunk evidence. Only about 10% of LLM annotations require correction.
3. Metrics and Evaluation Protocol
AGGBench provides a modular evaluation protocol targeting both completeness and accuracy:
Evidence completeness is measured by chunk-level recall:
Coverage(q)=∣G(q)∣∣R(q)∩G(q)∣
Result accuracy metrics:
ACE (Absolute Count Error): ACE(q)=∣y^−y∣, where y=∣Ans(q,C)∣ (gold count), y^ is the system’s count.
NACE (Normalized ACE): NACE(q)=y+ε∣y^−y∣, with ε preventing division by zero.
Coverage captures whether all relevant evidence is found. High ACE/NACE generally reflects low coverage, underscoring the principal challenge of achieving exhaustive retrieval rather than just plausible responses.
4. Dataset and Implementation Resources
AGGBench is distributed with both raw and processed data, as well as modular code for evaluation and agentic baseline experiments:
Repository structure:
data/raw_core/: original PDFs/texts of core papers
data/chunks/: tokenized chunk files
data/queries.json: full query set with predicate templates
data/gold_answers.json: gold-standard entity lists and chunk evidence mappings
code/benchmark.py: harness for evaluation and scoring
code/chunk_retriever.py: BM25 and dense retriever implementations
requirements.txt: dependencies, including transformers, faiss, and rank_bm25
Installation and usage: Python 3.9+ required, with setup via pip install -r requirements.txt. Data and code are downloaded and referenced by setting the DATAPATH</code>variable.Evaluationisconductedvia:!!!!0!!!!</li><li><strong>Access</strong>:Dataandcodeareavailableat<ahref="https://anonymous.4open.science/r/DFA−A4C1"rel="nofollownoopener">https://anonymous.4open.science/r/DFA−A4C1</a>(<ahref="/papers/2602.01355"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Zhuetal.,1Feb2026</a>).</li></ul><h2class=′paper−heading′id=′benchmark−statistics′>5.BenchmarkStatistics</h2><p>AGGBenchischaracterizedbyitsscale,evidencedensity,andcompositionalquerytypes.</p><divclass=′overflow−x−automax−w−fullmy−4′><tableclass=′tableborder−collapsew−full′style=′table−layout:fixed′><thead><tr><th>Statistic</th><th>Value/Range</th><th>Notes</th></tr></thead><tbody><tr><td>Totalqueries</td><td>362</td><td>100base(single−condition),262composite</td></tr><tr><td>Answersetsize(|\text{Ans}(q)|)</td><td>165querieswith>5answers</td><td>Max:20(single),29(composite)</td></tr><tr><td>Corecorpus</td><td>45docs→4,755chunks</td><td>294gold−evidencechunks(6.18</tr><tr><td>Expandedcorpus</td><td>16,294chunks</td><td>178gold−evidencechunks(1.09</tr><tr><td>Evidenceperquery(avg.)</td><td>\approx8.1chunks</td><td>Variesbyquery;reflectsmulti−chunkevidencenecessity</td></tr><tr><td>Querycompositionality</td><td>228double,34tripleconditions</td><td>42AND,220ORqueries</td></tr></tbody></table></div><p>Thisevidentiarysparseness,withmanyqueriesrequiringsynthesisof8ormoredistinctchunks,reflectstherealisticdifficultyofthe“find−all”aggregationsettinginunstructuredtext.</p><h2class=′paper−heading′id=′comparison−to−prior−approaches′>6.ComparisontoPriorApproaches</h2><p>AGGBenchexposesshortcomingsinprevalentmethodsforQAovertext:</p><ul><li><strong><ahref="https://www.emergentmind.com/topics/text−to−sql"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Text−to−SQL</a>(schema−firstapproaches)</strong>:<ul><li>DependonbrittleextractionpipelinestoconverttextintostructuredDBs,oftenyieldinglimitedcoverage.</li><li>Fixedschemaspreventon−the−flysynthesisofnew,compositional<ahref="https://www.emergentmind.com/topics/natural−language−queries−nlqs"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">naturallanguagequeries</a>.</li><li>EvenaccurateSQLcannotqueryforentitiesmissedbyinitialextraction,underminingcompleteness.</li></ul></li><li><strong><ahref="https://www.emergentmind.com/topics/retrieval−augmented−generation−rag"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Retrieval−AugmentedGeneration</a>(<ahref="https://www.emergentmind.com/topics/retrieval−augmented−generation−rag−poisoning"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">RAG</a>,rank−then−read)</strong>:<ul><li>Scoringfunctionsoptimizetop−krelevance,notexhaustiverecall.Whenk \ll |G(q)|,manyevidencechunksareomittedandanswercountsareunder−reported.</li><li>Increasingk$ undermines answer precision by flooding context with irrelevant or noisy chunks.
AGGBench is explicitly designed to isolate aggregation-specific failure modes, such as ambiguous entity boundaries, errors in predicate application (“filter roll-backs”), and the proper alignment of multi-chunk evidence. Evaluation protocol is centered on recall and completeness, rather than mere plausibility or relevance (Zhu et al., 1 Feb 2026).
7. Applications and Limitations
Use Cases
Legal e-discovery and contract analytics: e.g., “Find all contracts/papers that mention clause X.”
Financial and compliance auditing: e.g., “How many companies exhibit risk-factor Y in their disclosures?”
Investigative journalism: e.g., “List all sources meeting conditions A∧B among thousands of documents.”
Data-analysis agents: for scenarios where exhaustive filtering of entities from large text corpora is required.
Limitations
Domain specificity: The core is limited to research papers in the graph RAG field; adaptation to domains such as law or finance mandates new data curation and annotation.
Query scope: Only entity-count (aggregation) queries are supported; AGGBench does not address sum, average, or other numerical aggregations beyond counting.
Ambiguity handling: C-type entity ambiguities (granularity, deduplication, unknown labels) are rare in AGGBench and only addressed qualitatively.
No external knowledge: The protocol enforces a strict corpus-only (no outside KBs) policy.
Annotation cost: Despite initial LLM filtering, manual correction remains necessary for approximately 10% of labels.
AGGBench thus provides a rigorously defined foundation for testing, diagnosing, and benchmarking completeness-oriented aggregation query methods over unstructured text, with a reference agentic baseline (the DFA agent) that modularizes the disambiguation, filtering, and aggregation pipeline and exposes key system-level failure points (Zhu et al., 1 Feb 2026).