FinGround: Detecting and Grounding Financial Hallucinations via Atomic Claim Verification

Published 26 Apr 2026 in cs.AI, cs.CL, and cs.IR | (2604.23588v1)

Abstract: Financial AI systems must produce answers grounded in specific regulatory filings, yet current LLMs fabricate metrics, invent citations, and miscalculate derived quantities. These errors carry direct regulatory consequences as the EU AI Act's high-risk enforcement deadline approaches (August 2026). Existing hallucination detectors treat all claims uniformly, missing 43% of computational errors that require arithmetic re-verification against structured tables. We present FinGround, a three-stage verify-then-ground pipeline for financial document QA. Stage 1 performs finance-aware hybrid retrieval over text and tables. Stage 2 decomposes answers into atomic claims classified by a six-type financial taxonomy and verified with type-routed strategies including formula reconstruction. Stage 3 rewrites unsupported claims with paragraph- and table-cell-level citations. To cleanly isolate verification value from retrieval quality, we propose retrieval-equalized evaluation as standard methodology for RAG verification research: when all systems receive identical retrieval, FinGround still reduces hallucination rates by 68% over the strongest baseline ($p < 0.01$). The full pipeline achieves a 78% reduction relative to GPT-4o. An 8B distilled detector retains 91.4% F1 at 18x lower per-claim latency, enabling $0.003/query deployment, supported by qualitative signals from a four-week analyst pilot.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a finance-aware pipeline that decomposes claims for tailored verification using both textual and tabular evidence.
It employs a novel retrieval component and arithmetic formula reconstruction, achieving 91.4% detection F1 and significant hallucination reduction.
The system demonstrates practical value with cost-effective, near real-time inference, supporting regulatory compliance in financial QA.

FinGround: Finance-aware Hallucination Detection and Grounding via Atomic Claim Verification

Motivation and Background

LLMs powering financial QA systems frequently exhibit severe hallucinations, routinely fabricating financial metrics, miscalculating derived quantities, and generating unsupported regulatory references. Systematic analysis demonstrates high error rates even with retrieval-augmented models, with regulatory non-compliance risks heightened in impending high-stakes regimes such as the EU AI Act (Articles 14, 15). Existing hallucination detection methods in general NLP operate in a claim-agnostic manner, failing to address verification challenges unique to finance, particularly computational claims requiring arithmetic recomputation over structured tabular data. Prior financial QA systems (e.g., BloombergGPT (Wu et al., 2023), FinGPT (Yang et al., 2023)) and hallucination detectors (e.g., FActScore (Ip et al., 2023), SAFE (Wei et al., 2024)) do not deliver integrated claim-level verification with grounding in both unstructured and tabular evidence.

System Architecture

FinGround introduces a three-stage, finance-specific pipeline that addresses the core limitations of claim-agnostic detectors and enables claim-level grounding across unstructured text and table sources:

Finance-aware Retrieval: The pipeline initiates with a hybrid retrieval component leveraging complexity-aware query classification (RoBERTa-base classifier, 89.3% accuracy) to employ optimal retrieval strategies—BM25-based, dense retrieval (E5-large fine-tuned on financial pairs), and a novel table retrieval with column-header-aware similarity to preserve structured provenance for downstream attribution.
Atomic Claim Verification: Answers are decomposed into atomic claims based on a validated six-type taxonomy (numerical, temporal, entity-attribute, comparative, regulatory, computational). Each claim undergoes type-routed verification, including formula reconstruction for computational claims—identifying implied formulas, extracting operands, and recomputing results within defined tolerances. Evidence alignment is performed using a cross-encoder trained on >8,000 financial NLI examples (87.2% F1 in alignment), and a distilled 8B model delivers verdicts of supported, contradicted, or unverifiable.
Grounded Regeneration: Unsupported or contradicted claims are identified via fuzzy alignment with the answer, and only those spans are regenerated using a research-and-revise strategy anchored in retrieved evidence, with table-cell/pargraph-level inline citations. For high-error or multi-claim hallucinations, full answer regeneration is triggered.

Empirical Performance

FinGround demonstrates significant and robust improvements over SOTA detection methods in financial hallucination detection and mitigation. Under retrieval-equalized conditions, isolating verification from retrieval gains, FinGround reduces hallucination rates by 68% over the strongest baselines ( $p<0.01$ ) and by 78% relative to GPT-4o end-to-end. Key numerical highlights:

Detection F1 (FinHalu, claim level): 91.4% (distilled 8B model), which retains 96.2% of GPT-4o teacher performance at 18x lower per-claim latency.
Computational claim detection: +18.9 F1 improvement over SelfCheckGPT, underscoring the benefit of type-specific, formula-based verification.
End-to-end hallucination rates (FinQA): 3.6% (FinGround) vs. 18.6% (GPT-4o+CoT), with similar advances on TAT-QA and FinanceBench.
Annotation and human validation: Expert-human agreement ( $\kappa=0.87$ ) and direct validation confirm the pipeline's effectiveness; no evidence for circularity or overfitting to LLM supervision.

Furthermore, an efficient distillation protocol enables cost-effective deployment: the 8B detector model obtains 91.4% F1 at $0.003$/query, 18x faster inference per claim, and fits on a single A10G.

Analysis and Ablation

Ablations confirm that architecture, not only domain adaptation, drives performance gains: domain-adapted baselines (HHEM, SelfCheckGPT) close some of the gap but FinGround's lead persists by 5-12 F1 points at data parity. The largest remaining gap lies in computational and table-dependent claims, validating the importance of both fine-grained claim taxonomy and cross-modal, structure-aware verification. Removing financial claim typing or table retrieval each at least doubles error rates on relevant QA tasks.

Error analysis highlights:

False negatives: Cluster on numerically-paraphrased claims and near-ground-truth errors, especially for values within ±5% of the correct figure.
False positives: Concentrated in hedged language and restated (ambiguous) figures.

Cross-generator robustness is strong: F1 drops by only 3–4 points on Llama-3-70B and Claude-3.5-Sonnet outputs, with smallest degradation on computational claims (generator-agnostic arithmetic).

Practical and Theoretical Implications

FinGround affirms that domain-specific, claim-type-aware verification pipelines are necessary for high-stakes financial QA. They outperform generic detection or self-consistency-based methods—especially where claims require reasoning over tabular evidence and arithmetic consistency. It establishes retrieval-equalized evaluation as a methodological requirement for meaningful RAG verification benchmarking.

Practically, FinGround offers near real-time deployment (3.8s p95 latency/query) at commodity hardware footprints. Analyst pilot feedback confirms that table-cell-level grounding and explicit contradiction explanation support real analyst workflows, critical for regulatory compliance and oversight.

Theoretically, the architectural approach strengthens arguments for modular, explainable QA pipelines decomposing LLM reasoning into routing, claim decomposition, type-aware verification, and tailored regeneration. Methodological rigor in evaluation—e.g., normalization of granularity, retrieval-equalized baselines—is essential in tightly scoped, high-risk applications.

Limitations and Future Directions

FinGround's validation is concentrated on English-language U.S. SEC filings; transfer to other languages and regulatory regimes requires analogous claim taxononmy validation plus additional domain adaptation. Detection recall declines for near-threshold numeric errors, and error-compounding in multi-claim answers, though mitigated via full regeneration, still warrants deeper process audits. Additional work is needed on confidence calibration and integrating model agreement metrics for borderline decisions.

Possible future extensions include joint training of the whole pipeline, improved self-calibration, enhanced explainability modules, and extending the taxonomy to encompass emerging financial reporting forms (e.g., climate or ESG disclosures).

Conclusion

FinGround delivers a robust, production-scale architecture for hallucination detection and grounding in financial QA, introducing a specialized pipeline combining claim decomposition, domain-specific alignment and arithmetic verification, and grounded regeneration with explicit evidence attribution. Rigorous benchmarking shows a dominant reduction in hallucination rates and cost under both laboratory and analyst deployment settings, positioning claim-type-aware modular verification as an essential standard for regulated, high-fidelity financial AI systems (2604.23588).

Markdown Report Issue