Corporate Filing QA Benchmarks

Updated 13 January 2026

Corporate Filing QA Benchmarks are standardized frameworks offering structured datasets, evaluation protocols, and baseline systems to measure QA performance in diverse corporate documents.
They incorporate heterogeneous formats—including narratives, tables, and figures—to support extraction, numerical reasoning, and multi-hop logic in complex filings.
These benchmarks drive improvements in financial analysis and ESG reporting by enabling reproducible, evidence-based assessments of answer quality.

Corporate Filing QA Benchmarks comprise structured datasets, evaluation protocols, and baseline systems for systematically measuring question-answering (QA) performance over complex corporate filings. Such filings include annual reports, sustainability disclosures, and regulatory documents, characterized by heterogeneous formats—intermixed narratives, tables, and figures, often spanning hundreds of pages. These benchmarks address crucial tasks in financial analysis, ESG (Environmental, Social, Governance) reporting, and information disclosure quality assessment. The ecosystem encompasses both datasets tailored to sustainability and non-sustainability domains, with specialized metrics to assess fact extraction, numerical reasoning, cross-document retrieval, and answer faithfulness.

1. Taxonomy of Corporate Filing QA Benchmarks

A diverse set of benchmarks targets the spectrum of real-world corporate-filing QA requirements:

Name	Domain Scope	Document Sources	Distinctive Features
Climate Finance Bench	Sustainability, ESG	33 English sustainability reports	RAG baselines, carbon metrics, sectoral diversity (Mankour et al., 28 May 2025)
ESGBench	Explainable ESG QA	12 ESG/TCFD/CSR reports	Fine-grained evidence, table QA, explainability (George et al., 20 Nov 2025)
SustainableQA	ESG, EU Taxonomy, Sustainability	61 German/Austrian annual/sust. reports	195k QA pairs, hybrid extraction, table transformation (Ali et al., 5 Aug 2025)
FinAgentBench	Financial reporting (SEC)	10-K/10-Q filings (US)	Card-based reranking, constraint satisfaction (Zhou et al., 11 Jan 2026)
FinTruthQA	Disclosure quality, Q&A	China SSE/SZSE Q&A platforms	Human-annotated for answer quality, Chinese corpus (Xu et al., 2024)
SECQUE	Financial analysis (SEC)	45 US filings, 29 companies	565 expert questions, LLM as judge, numerical insight (Yoash et al., 6 Apr 2025)
SEC-QA	Multi-document, multi-reasoning	1,315 SEC filings, tabular DB	Continuous generation, program-of-thought (2406.14394)

Domain coverage ranges from pure financial metrics and risk assessment to advanced sustainability taxonomies and answer quality. Some benchmarks emphasize explainability or multi-hop reasoning, while others focus on fine-grained retrieval or the ability to process novel, evolving corpora.

2. Dataset Construction and Annotation Methodologies

Benchmark construction methods address the inherent heterogeneity of filings:

Document Selection and Preprocessing: Reports, typically PDFs or HTML filings, are parsed and chunked to conserve logical structure (headings, tables, captions). For instance, Climate Finance Bench uses overlapping 2,048-token chunks with special handling for tables (Mankour et al., 28 May 2025). SustainableQA automates PDF-to-Markdown conversion, preserving tables and narrative alignment (Ali et al., 5 Aug 2025). SEC-QA parses HTML filings into JSON with structured tables, supporting fine-grained grounding (2406.14394).
Question-Answer Pair Curation: Manual expert annotation is central in Climate Finance Bench (seven ESG analysts, 330 QA pairs), ESGBench (prompt-based, de-duplicated spans with evidence_quote), and SECQUE (565 expert-written questions supported by extracted filing chunks) (Mankour et al., 28 May 2025, George et al., 20 Nov 2025, Yoash et al., 6 Apr 2025). SustainableQA employs a hybrid approach, combining fine-tuned NER, rule-based extraction, and LLM-driven refinement for large-scale QA generation (Ali et al., 5 Aug 2025).
Coverage of Reasoning Types: Benchmarks distinguish pure extraction (span lookup), numerical reasoning (arithmetic over extracted numbers), logical reasoning (multi-passage chaining), comparative or trend analysis, and multi-span/grouped answers (Mankour et al., 28 May 2025, George et al., 20 Nov 2025).
Ground-truth Evidence and Traceability: Many QA sets are explicitly evidence-grounded, requiring answers to be justified by verbatim or minimal supporting spans. ESGBench mandates a matching evidence_quote, while FinTruthQA encodes answer quality on a 3-level scale (George et al., 20 Nov 2025, Xu et al., 2024).

3. Evaluation Protocols and Metrics

Benchmarks employ a suite of rigorous metrics, often combining span-level, document-level, and answer-level evaluation:

Retrieval Metrics: Passage Recall, Recall@K, Evidence Selection Accuracy, nDCG@k, MAP@k, MRR@k (Mankour et al., 28 May 2025, Zhou et al., 11 Jan 2026, George et al., 20 Nov 2025).
- For instance, nDCG@10 and MRR@10 are used in FinAgentBench to quantify early-rank retrieval quality (Zhou et al., 11 Jan 2026).
Answer Spans/Quality:
- Exact Match (EM): Percentage of predictions matching gold answers exactly (token or span-level).
- Token-level Precision, Recall, F₁: Computed over overlapping (predicted, gold) tokens.
- Numeric Accuracy@ε%: For numeric QAs, tolerance-based match on values/units.
- EM and F₁ Formulas (ESGBench, SEC-QA): $F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}$
Automated Judging: SECQUE introduces SECQUE-Judge, an ensemble LLM-based panel assigning fully/partially/incorrectly correct labels with demonstrated high F1 alignment to human experts (Yoash et al., 6 Apr 2025).
Quality Annotations: FinTruthQA rates Q&A pairs on structured and semantic dimensions—readability (clarity, logic), direct relevance, with inter-annotator agreement tracked via Kappa (Xu et al., 2024).
Additional Metrics: BLEU, ROUGE-L, METEOR for non-factoid and table-based QA (SustainableQA); QWK for ordinal scales (FinTruthQA).

4. Baseline Architectures and Systematic Findings

A spectrum of retrieval, reranking, and answer-generation architectures have been benchmarked:

Retrieval-Augmented Generation (RAG): All major benchmarks employ some form of RAG, typically comparing dense vector retrieval (sentence-transformers, Ada), sparse (BM25), hybrid fusion, and various chunk selection/fusion methods (Mankour et al., 28 May 2025, George et al., 20 Nov 2025, 2406.14394).
- Hybrid dense+BM25 with cross-encoder reranking consistently boosts QA accuracy (up to +7 pp over pure dense retrieval in Climate Finance Bench) (Mankour et al., 28 May 2025).
Structured Reranking: FinCARDS reframes passage scoring as constraint satisfaction, introducing "Cards" encoding explicit schema—entities, financial metrics, periods, numeric spans—to enforce field-level matching and stability (Zhou et al., 11 Jan 2026). This multi-stage reranking outperforms BM25 and zero-shot LLM reranking (+27.3 nDCG@10), reduces ranking variance, and enables auditable decision traces.
Program-of-Thought Pipelines: SEC-QA demonstrates that decomposing questions via code-generating LLMs to select documents, retrieve pages, and extract values, with helper functions for fine-grained selection, yields dramatic accuracy improvements in MDQA (up to 80% Exact Match in multi-document QA, compared to 30–55% for generic RAG) (2406.14394).
Model Performance Trends:
- Retrieval quality is the chief bottleneck—model size does not compensate for missed evidence (Mankour et al., 28 May 2025).
- Answer correctness on logical/multi-hop questions lags behind pure extraction and numeric tasks, revealing retrieval and aggregation limitations (Mankour et al., 28 May 2025, Yoash et al., 6 Apr 2025).
- Quantized local models trade minimal accuracy for substantial carbon and memory savings (e.g., 4-bit Llama 3.1 8B: ~1–2 pp drop, ~75% lower CO₂e) (Mankour et al., 28 May 2025).

5. Analysis of Experimental Outcomes

Quantitative and qualitative analyses reveal both advances and persistent challenges:

Error Analysis identifies systemic weaknesses:
- Table parsing and unit mismatch errors (ktCO₂e vs tCO₂e, million/billion confusion) (George et al., 20 Nov 2025).
- Retrieval layer misses gold evidence, especially for multi-hop or composite queries—resulting in hallucinated, partial, or empty answers.
- Prompt brittleness: paraphrasing instead of verbatim disagreement; dropped decimal precision (George et al., 20 Nov 2025, Mankour et al., 28 May 2025).
- Models adept at structural question identification but less reliable on nuanced answer quality or indirect logic (e.g., cross-sentence entailment) (Xu et al., 2024).
Category-Specific Performance:
- Numerical and extraction tasks consistently outperform logical/multi-hop and analyst insight tasks. For example, Climate Finance Bench reports 69.7% correctness for numerical, 65.7% extraction, but only ~50% on logical reasoning QAs (Mankour et al., 28 May 2025). In SECQUE, “Analyst Insights” represents the hardest class (strict accuracy 0.69 vs. 0.46–0.65 for smaller models) (Yoash et al., 6 Apr 2025).
- Large-scale benchmarks (e.g., SustainableQA, 195k QA pairs) enable fine-grained span complexity analysis—single-span dominates, but multi-span and reasoning clusters are challenging for pointer-generator models (Ali et al., 5 Aug 2025).
Evaluation Automation and Alignment:
- SECQUE-Judge’s LLM panel closely matches human raters (F1(2) = 0.85), supporting scalable free-text QA evaluation (Yoash et al., 6 Apr 2025).
- SEC-QA finds a strong empirical correlation between upstream retrieval recall and final QA EM (R²≈0.94), reiterating the primacy of retrieval optimization (2406.14394).

6. Open Problems, Limitations, and Future Directions

Several limitations and prospective research trajectories are consistently highlighted:

Retrieval and Structure: All evidence indicates that type-aware, card-based, or entity-centric retrieval (Typed-RAG, GraphRAG) will be required to overcome multi-hop and cross-document challenges—generic dense or BM25 retrieval is insufficient (Mankour et al., 28 May 2025, Zhou et al., 11 Jan 2026).
Explainability and Traceability: Explicit evidence grounding increases transparency, but automated tools for chain-of-thought, rationale extraction, and evidence alignment remain open problems (George et al., 20 Nov 2025).
Coverage and Domain Shift: Current datasets may be biased by language (English, Chinese, German), geography, or sector (large multinationals overrepresented), and limited in scale or domain (ESG focus vs. broader filings) (George et al., 20 Nov 2025, Ali et al., 5 Aug 2025).
Evaluation Extensions: Most benchmarks lack fine-grained test splits, adversarial/noisy context, or robust multilingual evaluation; integrating layout-aware models (LayoutLM), programmable retrieval, and scalable human validation is prioritized (George et al., 20 Nov 2025, Ali et al., 5 Aug 2025, 2406.14394).
Sustainability and Carbon Footprint: Benchmarking must include transparent carbon accounting for LLM inference, as in Climate Finance Bench, to ensure alignment with “AI for climate” objectives (Mankour et al., 28 May 2025).
Data Pipeline Reproducibility: Extensive reliance on proprietary LLMs/API limits reproducibility and automation (table transformation, answer verification) (Ali et al., 5 Aug 2025).

7. Representative Examples and Practical Usage

Benchmarks report typical QA pairs covering the full complexity spectrum:

Extraction: “Has the company identified significant decarbonization levers?” → “Yes. The company’s decarbonization levers are fleet electrification, renewable procurement, and energy-efficiency upgrades” (Mankour et al., 28 May 2025).
Numerical Reasoning: “Carbon intensity (tCO₂/million USD) for FY 2023?” → “1,850,000 tCO₂ ÷ 1,200 million USD = 1.54 tCO₂ per million USD” (Mankour et al., 28 May 2025).
Logical Reasoning: “Is the decarbonization trajectory compatible with a 1.5°C scenario?” → “Yes, the −50% target by 2030 aligns with the required ~55% reduction, with offsets possible for the gap” (Mankour et al., 28 May 2025).
ESGBench: “How did the percentage of renewable energy in 2021 compare to 2020?” → “It increased from 27 % to 35 %.” Evidence: “Renewables comprised 27 % in 2020 and 35 % in 2021” (George et al., 20 Nov 2025).
Factoid/Span Complexity (SustainableQA): Multi-span—“Which three conditions constitute the substantial contribution criterion?” → “verified emission savings,” “climate-neutral energy inputs,” “end-of-life recycling plans” (Ali et al., 5 Aug 2025).

These examples manifest the benchmarks’ intent: to rigorously stress LLM/QA system capabilities in real-world, heterogeneous disclosure corpora, supporting reproducible, auditable comparisons for both financial and sustainability applications.