AGGBench: Aggregation Query Benchmark

Updated 8 February 2026

AGGBench is a benchmark that assesses systems performing entity-level aggregation queries, focusing on exhaustive evidence retrieval and strict completeness.
It combines a core corpus of research papers with noise documents to simulate realistic, large-scale challenges for multi-chunk evidence identification.
The evaluation employs metrics like chunk-level coverage, ACE, and NACE to provide actionable insights into system performance in terms of recall and accuracy.

AGGBench is a benchmark designed to evaluate the completeness and evidence coverage of systems performing entity-level aggregation queries over unstructured text corpora. Unlike typical question answering tasks, aggregation queries require systems to exhaustively identify all entities satisfying complex, compositional conditions, thereby placing stringent demands on evidence retrieval, disambiguation, and aggregation processes. AGGBench is corpus-bounded, explicitly prohibiting the use of external knowledge and focusing evaluation on the ability to “find all” such entities within realistic, large-scale, noisy corpora (Zhu et al., 1 Feb 2026).

1. Formalization of Aggregation over Unstructured Text

AGGBench targets the formal problem of entity-level aggregation querying under a strict completeness regime. The corpus $C = \{c_1, \ldots, c_M\}$ is partitioned into $M$ text chunks; $\mathcal{E}(C)$ denotes all entity mentions across $C$ . Each query $q$ specifies:

an entity type $T$
a predicate set $\Phi = \{\phi_1, \ldots, \phi_m\}$ , each a boolean condition on entities of type $T$ , satisfied only if there is explicit evidence in $C$

The exact answer set is:

$\text{Ans}(q, C) = \{ e \in \mathcal{E}(C)\ |\ \text{type}(e)=T\ \wedge\ \forall \phi_i \in \Phi: \phi_i(e)=\text{true} \}$

where $\text{Ans}(q, C) = \bigcup_{c\in C} \text{Ans}(q, c)$ . For each $e$ , at least one supporting chunk evidencing each predicate must be found. AGGBench emphasizes strict recall: all satisfying entities and corresponding evidentiary chunks must be recovered, not merely plausible answers.

The key evaluation metric is evidence coverage at the chunk level:

$\text{Coverage}(q) = \frac{|R(q) \cap G(q)|}{|G(q)|}$

where $G(q)$ is the gold set of evidence chunks, and $R(q)$ is the set returned by the system.

2. Benchmark Construction and Annotation

The construction of AGGBench is designed to enable completeness-oriented evaluation under realistic, noisy conditions.

Corpus Design

Core corpus: 45 research papers from the “graph retrieval–augmented generation” literature (e.g., NeurIPS, ICLR), chunked into 200–300-token segments, totaling 4,755 chunks.
Expansion: 11,539 unrelated, noise-inducing documents were added, yielding a final corpus of 16,294 chunks. BM25 proximity filtering ensured that no new satisfying entities were introduced by noise documents.

Query and Condition Generation

Entity types ( $T$ ) were extracted and ranked by frequency, with manual curation to remove ambiguous categories.
Conditions ( $\phi$ ) were mined via high-frequency descriptive phrases (e.g., “used for multi-hop QA,” “applied to legal domain”), then manually refined for compositionality and unambiguous meaning.
Resulting queries are natural-language prompts about entity counts, such as:
- “How many datasets are used for multi-hop question answering?”
- “How many papers apply to the legal domain?”
The benchmark comprises 362 queries: 100 base (single-condition), and 262 composite (multiple AND/OR conditions).

Evidence Annotation Workflow

Annotation is two-stage:

LLM pre-annotation: A LLM annotates each (query, chunk) pair as positive/negative and extracts candidate entities, filtering 90% as clear negatives.
Human verification: Annotators review and correct LLM outputs, ensure adequate evidence grounding for each entity, and consolidate multi-chunk evidence. Only about 10% of LLM annotations require correction.

3. Metrics and Evaluation Protocol

AGGBench provides a modular evaluation protocol targeting both completeness and accuracy:

Evidence completeness is measured by chunk-level recall:

$Coverage(q) = \frac{|R(q) \cap G(q)|}{|G(q)|}$

Result accuracy metrics:
- ACE (Absolute Count Error): $ACE(q) = |\hat{y} - y|$ , where $y = |\text{Ans}(q, C)|$ (gold count), $\hat{y}$ is the system’s count.
- NACE (Normalized ACE): $NACE(q) = \frac{|\hat{y} - y|}{y + \varepsilon}$ , with $\varepsilon$ preventing division by zero.

Coverage captures whether all relevant evidence is found. High ACE/NACE generally reflects low coverage, underscoring the principal challenge of achieving exhaustive retrieval rather than just plausible responses.

4. Dataset and Implementation Resources

AGGBench is distributed with both raw and processed data, as well as modular code for evaluation and agentic baseline experiments:

Repository structure:
- data/raw_core/: original PDFs/texts of core papers
- data/chunks/: tokenized chunk files
- data/queries.json: full query set with predicate templates
- data/gold_answers.json: gold-standard entity lists and chunk evidence mappings
- code/benchmark.py: harness for evaluation and scoring
- code/chunk_retriever.py: BM25 and dense retriever implementations
- code/dfa_agent.py: DFA agent baseline
- code/utils/: tokenization and chunking scripts
- requirements.txt: dependencies, including transformers, faiss, and rank_bm25
Installation and usage: Python 3.9+ required, with setup via pip install -r requirements.txt. Data and code are downloaded and referenced by setting the $DATAPATH</code> variable. Evaluation is conducted via:$ |\text{Ans}(q)| $)</td> <td>165 queries with$ >5 $answers</td> <td>Max: 20 (single), 29 (composite)</td> </tr> <tr> <td>Core corpus</td> <td>45 docs → 4,755 chunks</td> <td>294 gold-evidence chunks (6.18% evidence density)</td> </tr> <tr> <td>Expanded corpus</td> <td>16,294 chunks</td> <td>178 gold-evidence chunks (1.09% density)</td> </tr> <tr> <td>Evidence per query (avg.)</td> <td>$ \approx8.1 $chunks</td> <td>Varies by query; reflects multi-chunk evidence necessity</td> </tr> <tr> <td>Query compositionality</td> <td>228 double, 34 triple conditions</td> <td>42 AND, 220 OR queries</td> </tr> </tbody></table></div> <p>This evidentiary sparseness, with many queries requiring synthesis of 8 or more distinct chunks, reflects the realistic difficulty of the “find-all” aggregation setting in unstructured text.</p> <h2 class='paper-heading' id='comparison-to-prior-approaches'>6. Comparison to Prior Approaches</h2> <p>AGGBench exposes shortcomings in prevalent methods for QA over text:</p> <ul> <li><strong><a href="https://www.emergentmind.com/topics/text-to-sql" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Text-to-SQL</a> (schema-first approaches)</strong>: <ul> <li>Depend on brittle extraction pipelines to convert text into structured DBs, often yielding limited coverage.</li> <li>Fixed schemas prevent on-the-fly synthesis of new, compositional <a href="https://www.emergentmind.com/topics/natural-language-queries-nlqs" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">natural language queries</a>.</li> <li>Even accurate SQL cannot query for entities missed by initial extraction, undermining completeness.</li> </ul></li> <li><strong><a href="https://www.emergentmind.com/topics/retrieval-augmented-generation-rag" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Retrieval-Augmented Generation</a> (<a href="https://www.emergentmind.com/topics/retrieval-augmented-generation-rag-poisoning" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">RAG</a>, rank-then-read)</strong>: <ul> <li>Scoring functions optimize top-$ k $relevance, not exhaustive recall. When$ k \ll |G(q)| $, many evidence chunks are omitted and answer counts are under-reported.</li> <li>Increasing$ k$ undermines answer precision by flooding context with irrelevant or noisy chunks.




AGGBench is explicitly designed to isolate aggregation-specific failure modes, such as ambiguous entity boundaries, errors in predicate application (“filter roll-backs”), and the proper alignment of multi-chunk evidence. Evaluation protocol is centered on recall and completeness, rather than mere plausibility or relevance (Zhu et al., 1 Feb 2026).
7. Applications and Limitations
Use Cases

Legal e-discovery and contract analytics: e.g., “Find all contracts/papers that mention clause X.”
Financial and compliance auditing: e.g., “How many companies exhibit risk-factor Y in their disclosures?”
Investigative journalism: e.g., “List all sources meeting conditions A∧B among thousands of documents.”
Data-analysis agents: for scenarios where exhaustive filtering of entities from large text corpora is required.

Limitations

Domain specificity: The core is limited to research papers in the graph RAG field; adaptation to domains such as law or finance mandates new data curation and annotation.
Query scope: Only entity-count (aggregation) queries are supported; AGGBench does not address sum, average, or other numerical aggregations beyond counting.
Ambiguity handling: C-type entity ambiguities (granularity, deduplication, unknown labels) are rare in AGGBench and only addressed qualitatively.
No external knowledge: The protocol enforces a strict corpus-only (no outside KBs) policy.
Annotation cost: Despite initial LLM filtering, manual correction remains necessary for approximately 10% of labels.


AGGBench thus provides a rigorously defined foundation for testing, diagnosing, and benchmarking completeness-oriented aggregation query methods over unstructured text, with a reference agentic baseline (the DFA agent) that modularizes the disambiguation, filtering, and aggregation pipeline and exposes key system-level failure points (Zhu et al., 1 Feb 2026).

      
        
          
  
    

    Markdown

  
    

    Report Issue


          
  
    

    Upgrade to Chat

        

      

      



  
    

    References (1)

    
  
  
    

    
      
        
          1.
        
        
          Aggregation Queries over Unstructured Text: Benchmark and Agentic Method 

          (2026)




  
    


  












  


    
    

        
        
            

        
        

      
      
          Topic to Video (Beta)

        
            
  


    No one has generated a video about this topic yet.
    
        
          

          Sign Up to Generate
        
          

          All Videos

      
  

  Subscribe on YouTube

    



        
      
      
    
    
  











  


    
    

        
        
            

        
        

      
      
          Whiteboard

        
            
  



    No one has generated a whiteboard explanation for this topic yet.
    
        
          

          Sign Up to Generate
    



        
      
      
    
    
  










  


    
    

        
        
            

        
        

      
      
          Follow Topic

        
            
  Get notified by email when new papers are published related to AGGBench.

  
      
        

        Sign Up to Follow Topic by Email
  

        
      
      
    
    
  










  


    
    

        
        
            

        
        

      
      
          Continue Learning

        
            
    
        
          How does AGGBench differ from traditional question answering benchmarks in terms of evidence completeness? 

        
        
          What are the main challenges in achieving exhaustive retrieval of multi-chunk evidence in AGGBench? 

        
        
          How does the corpus design and noise injection contribute to the overall evaluation difficulty? 

        
        
          What roles do the ACE and NACE metrics play in assessing the aggregation performance? 

        
        
          Find recent papers about entity-level aggregation in unstructured text. 

        
    

        
      
      
    
    
  










  


    
    

        
        
            

        
        

      
      
          Related Topics

        
            
    
        
          GraphRAG-Bench: Evaluating Graph Retrieval Models 

        
        
          WideSearch Benchmark Evaluation 

        
        
          T^2-RAGBench: Multi-Modal Financial RAG Evaluation 

        
        
          LegalBench-RAG: Legal Retrieval Benchmark 

        
        
          LegalBench-RAG: Precise Legal Retrieval Evaluation 

        
        
          Explainable ESG QA Systems 

        
        
          FinanceBench: Financial QA Benchmark 

        
        
          Corporate Filing QA Benchmarks 

        
        
          FinAgentBench: Agentic Retrieval in Finance 

        
        
          WildGraphBench: Benchmarking GraphRAG Systems


    

    
    


    
      
        
          Content



            
              

              Overview

              
                

                References

            
              

              Topic to Video

            
              

              Whiteboard

            
              

              Follow Topic

            
              

              Continue Learning

            
              

              Related Topics



  

  
    
      
        Stay informed about trending AI papers: