Finance Agent Benchmark (FAB)

Updated 6 March 2026

FAB is a comprehensive, multi-dimensional evaluation suite that measures real-world competence, alignment, and safety of LLMs and agents in finance.
It benchmarks diverse financial use cases—from fundamental analysis and decision-making to end-to-end workflows and compliance—using structured taxonomies and performance metrics.
FAB employs agent-environment architectures and tool orchestration to simulate realistic financial tasks, revealing challenges in retrieval accuracy, numerical reasoning, and safety alignment.

A Finance Agent Benchmark (FAB) is a rigorous, multi-dimensional evaluation suite purpose-built to measure the real-world competence, alignment, and safety of LLMs and @@@@1@@@@ operating in finance and financial research. Contemporary FABs integrate complex, workflow-centric tasks, agent-environment interaction, tool orchestration, and alignment diagnostics, exposing the nuanced requirements and unresolved challenges of automating high-stakes financial reasoning and decision support.

1. Scope and High-Level Taxonomy

FABs target a broad spectrum of financial agent use cases, ranging from fundamental analysis, wealth management, decision-making, risk management, and workflow automation to advanced tasks such as agentic retrieval, compliance, ESG analysis, and spreadsheet-centric enterprise operations. Benchmarks in the literature are organized into several core domains:

FAB Domain	Representative Benchmarks	Primary Task Types
Fundamental Analysis	FinAR-Bench (Wu et al., 22 May 2025)	Extract–compute–reason (financial statements)
Decision-Making	InvestorBench (Li et al., 2024)	POMDP trading, investing, forecasting
Agentic Retrieval	FinAgentBench (Choi et al., 7 Aug 2025)	Document-type reasoning, passage selection
Workflow/E2E	Finch (Dong et al., 15 Dec 2025), FinGAIA (Zeng et al., 23 Jul 2025), FAB (Bigeard et al., 20 May 2025)	Workflow simulation, research pipelines, QA
Safety & Alignment	FinTrust/FAB (Hu et al., 17 Oct 2025), FinVault (Yang et al., 9 Jan 2026)	Safety/fairness/robustness disclosure
ESG & Verticals	ESGAgent/FAB (Zhao et al., 13 Jan 2026)	ESG compliance, professional report synthesis

Financial Agent Benchmarks formalize rich taxonomies: e.g., nine financial research categories in (Bigeard et al., 20 May 2025), seven research task types in (Sun et al., 22 Jul 2025), or hierarchical scenario depths and domains in (Zeng et al., 23 Jul 2025). These taxonomies emphasize both coverage (across subdomains such as securities, banking, insurance, etc.) and escalating complexity: from atomic retrieval and QA to open-ended, agentic, and judgment-driven synthesis.

2. Agent, Environment, and Workflow Architectures

Modern FABs embrace agent-environment architectures, with agents functioning as planning, reasoning, and interactional units capable of invoking external tools (e.g., EDGAR APIs, CRM systems, data stores), managing persistent memory, and orchestrating multi-step workflows.

Decision-Making and Memory: InvestorBench (Li et al., 2024) models single-asset trading as a partially observable MDP, integrating perception, profiling, layered memory (working + long-term), and an action module. Multi-modal state spaces (OHLCV, news, filings) underpin asset selection via flexible LLM "brains."
Agentic Tool Use: Benchmarks such as the FAB in (Bigeard et al., 20 May 2025), FinGAIA (Zeng et al., 23 Jul 2025), and Finch (Dong et al., 15 Dec 2025) employ tool-augmented harnesses for scraping, file parsing, spreadsheet editing, and API interaction, pushing agents beyond static prompt-completion toward operational financial automation.
Logic Tree Extraction: FinResearchBench (Sun et al., 22 Jul 2025) introduces a logic-tree-based "Agent-as-a-Judge" paradigm—extracting hierarchical argument/evidence trees from generated research for structural and qualitative scoring.

Example pipeline for document-centric analysis in FinAgentBench (Choi et al., 7 Aug 2025):

Given a query, rank relevant document types (10-K, 8-K, etc.).
Within the selected document, perform passage-level ranking to extract fine-grained answers.

Explicit agent-environment architectures, sandboxed platforms (with Docker orchestration, persistent audit logs (Milsom, 1 Dec 2025)), and adversarial multi-turn threat models (test-mode, permission abuse (Yang et al., 9 Jan 2026)) have become reference standards.

3. Task Formalism, Metrics, and Error Taxonomies

FABs explicitly define task structures, scoring methodologies, and error analyses that reflect real-world finance requirements:

Typical Formalisms:

POMDP/MDP Formulation: Agents maximize discounted rewards in simulated trading (Li et al., 2024), with $O_t$ (observation), $B_t$ (belief/memory), $\mathcal{A}$ (action), $R_t$ (PnL), and $\pi$ (policy).
Table-Driven Financial Analysis: Extraction, indicator calculation, and reasoning defined via precise schemas, e.g., RMS Precision/Recall in FinAR-Bench (Wu et al., 22 May 2025), with alignment on attribute, value, and key-matching.
Workflow Completion: Task success as binary (pass/fail) per human-annotator or LLM-as-judge rubric (Dong et al., 15 Dec 2025), with agent trajectories logged for auditability.

Representative Metrics:

Metric	Formula / Definition	Benchmark Context
Cumulative Return (CR)	$CR = \prod_{t=1}^{T} (1 + r_t) - 1$	InvestorBench
Sharpe Ratio (SR)	$SR = \frac{\mathbb{E}[r_p]}{\sigma[r_p]}$	InvestorBench, FAB (Lin et al., 22 Feb 2026)
nDCG@k, MAP@k, MRR@k	$nDCG@k, MAP@k, MRR@k$ for rank-based retrieval accuracy	FinAgentBench (Choi et al., 7 Aug 2025)
Safety Violation Rate	$V_{safety}^{(k)} = 1 - \frac{1}{N_k} \sum_{i=1}^{N_k} v_{i}^{(k)}$	FinTrust/FAB (Hu et al., 17 Oct 2025)
Checkpoint Success Rate	$S_{cp} = \frac{\sum_{i=1}^T cp\_passed_i}{C_{total}}$	Wealth Mgt FAB (Milsom, 1 Dec 2025)
Workflow Pass Rate	$\mathrm{PassRate} = \frac{\#\text{workflows passed}}{\#\text{total workflows}}$	Finch (Dong et al., 15 Dec 2025)

Error Taxonomies in (Jiang et al., 7 Feb 2026, Dong et al., 15 Dec 2025), and (Choi et al., 7 Aug 2025) separate retrieval, generation/hallucination, financial calculation, and query-understanding errors. Fine-grained categories (comparative stance hallucinations, entity/time mismatch, formula reasoning, code generation) enable diagnostic reporting.

4. Dataset Construction, Modalities, and Domain Coverage

FABs are sourced from authentic, enterprise-grade corpora, regulatory filings, synthetic agent environments, and real or simulated market data. Domain scope is ensured by:

Source Artefacts: Enron spreadsheets and email threads, SEC EDGAR filings (10-K, 10-Q, DEF 14A, earnings calls), synthetic financial ledgers, CRM/advisory platforms, and sustainability/ESG reports (Dong et al., 15 Dec 2025, Bigeard et al., 20 May 2025, Zhao et al., 13 Jan 2026).
Data Diversity: Multimodal inputs (text, tables, PDFs, images, audio, excel), with compositional and longitudinal design—identifying cross-entity, multi-period, and multi-tool integration as critical factors (Jiang et al., 7 Feb 2026, Dong et al., 15 Dec 2025).
Task Splitting: From atomic (True/False, MCQ) financial literacy to multi-step research, document synthesis, cross-file aggregation, and end-to-end portfolio analysis.

For example, ESGAgent/FAB (Zhao et al., 13 Jan 2026) delivers a tiered benchmark: atomic question answering, intermediate compositional/multimodal tasks, and generative professional report synthesis, with scoring along citation correctness, analysis depth, and chart expressiveness.

5. Agentic Safety, Alignment, and Trust Evaluation

With high-stakes exposure, FABs systematically measure agent robustness, safety alignment, fairness, transparency, and test against operational vulnerabilities:

Safety, Fairness, Fiduciary Alignment: FinTrust/FAB (Hu et al., 17 Oct 2025) scores per-dimension: safety violation rates under adversarial prompt attacks; fairness disparity (Cohen’s d); fiduciary prompt invariance and explicit disclosure completeness. Empirical findings document severe gaps: genetic-algorithm attacks subvert most models; legal conflict-of-interest disclosure remains unsolved.
Execution-Grounded Security: FinVault (Yang et al., 9 Jan 2026) instantiates 31 regulatory sandbox scenarios, with agents controlling state-changing APIs (e.g., credit issuance, payments), and judges compliance violations by inspecting the business-state database rather than output text. Attack success rate (ASR), vulnerability compromise, and defense FPR/TPR quantify risk.
AgentOps and Governance: Open FAB suites integrate agent trace logging, glass-box testing, and streaming LLM-as-Judge pipelines (AgentOps framework (Lin et al., 22 Feb 2026)) for in-production benchmarking, continuous feedback, and AI governance (Linux Foundation-compliant modules, MOF openness classification).

6. Comparative Results, Bottlenecks, and Analysis

FABs provide extensive empirical results and comparative evaluation guidance:

E2E Workflow Performance: On the FAB (Bigeard et al., 20 May 2025), even frontier agents (OpenAI o3) cap at 46.8% class-balanced accuracy on recent SEC filings, with significant failure modes in hallucinations, retrieval, and numerical reasoning.
Multi-Task Degradation: In Finch (Dong et al., 15 Dec 2025), GPT-5.1 Pro passes only 38.4% of >150 real-world enterprise workflows (25% for Claude 4.5), with pass rates dropping sharply as workflows interleave more steps, modalities, and cross-file dependencies.
Alignment Shortfalls: FinTrust/FAB (Hu et al., 17 Oct 2025) reveals proprietary models nearly solve static safety and fairness, but no system exhibits robust legal disclosure; all models are prompt-sensitive and misalign on fiduciary tasks.
Agentic Retrieval Gap: In FinAgentBench (Choi et al., 7 Aug 2025), state-of-the-art LLMs nearly perfect document-type ranking ( $\text{MRR@5}\sim0.89$ ), but struggle on chunk-level retrieval $(\text{nDCG@5}\sim0.35-0.42)$ —indicating major IR and context comprehension bottlenecks.

FAB Setting	Top Agent Result	Expert/Human	Notable Failure Modes
SEC Report Fab	46.8% CBA (OpenAI o3)	84.7% (Finance PhDs)	Hallucination, tool mishandling, tabular extraction
ESGAgent/FAB	84.15% (ESGAgent)	N/A	Citation faithfulness, chart integration
Finch (Enter. WF)	38.4% (GPT-5.1 Pro)	N/A	Structuring/formatting, cross-file calculation error
Decision-Making	30–45% CR gain (GPT-4)	Baseline	Small models underperform baseline; volatility gaps
Safety/Compliance	6.7% ASR (Claude Haiku 4.5)	N/A	Execution attacks, multi-turn trust erosion

7. Limitations, Open Challenges, and Future Directions

Current FABs expose fundamental unmet needs in both benchmarking methodology and agentic model capabilities:

Multimodal and Longitudinal Reasoning: Existing agents consistently underperform when integrating cross-document, multi-period, or multimodal sources, with error propagation dominating multi-step workflow evaluation.
Safety/Compliance Alignment: Execution-grounded safety, not just content compliance, is unresolved; roleplay, progressive prompt injection, and semantic attacks evade most defenses in realistic settings (Yang et al., 9 Jan 2026).
Evaluation Science: Automated grading (LLM-as-judge) is only partially reliable and requires rigorous human calibration; metric aggregation and partial credit for synthesis tasks remain open technical challenges.
Domain Expansion and Scenario Breadth: Further coverage is needed for region-specific regulation, derivatives, credit risk, advisory tasks, and counterfactual robustness (data perturbations, adversarial evaluation).
System Integration and Open Leadership: Open FAB platforms increasingly emphasize continuous benchmarking (leaderboards (Lin et al., 22 Feb 2026)), agent operational reliability, governance-compliant design, and reproducibility of both code and financial data lineage.

A plausible implication is that future FABs will converge on aligned, execution-grounded, workflow-centric frameworks—anchoring both research and industry deployment of financial AI agents around comprehensive, diagnosis-centric, and regulation-aware evaluation standards.