FinForge-5k: Finance LM Benchmark
- FinForge-5k is a large-scale benchmark that assesses language models on complex financial reasoning through over 5,000 expert-validated Q/A pairs.
- It utilizes a semi-synthetic pipeline combining expert curation, automated corpus construction, and multi-stage LM validation to ensure high-quality evaluations.
- The benchmark covers 11 financial subdomains and provides detailed performance metrics, highlighting model strengths and areas for improvement.
FinForge-5k is a large-scale, high-fidelity benchmark designed to evaluate LMs in the domain of finance. As a product of the FinForge pipeline, it provides over 5,000 human-validated multiple-choice question–answer (Q/A) pairs probing conceptual, quantitative, and integrative financial reasoning. Each Q/A item is self-contained and spans one of eleven rigorously structured subdomains, facilitating granular assessments of domain competence and reasoning fidelity in LMs. FinForge-5k addresses the chronic scarcity of high-quality, domain-specific evaluation resources for high-stakes financial applications (Matlin et al., 11 Jan 2026).
1. Scope, Structure, and Subdomain Coverage
FinForge-5k consists of more than 5,000 multiple-choice Q/A pairs, validated by domain experts and by automated LM-based evaluation, and chosen to systematically probe the nuances of specialized financial reasoning. Each question is designed around a practical or theoretical scenario, combining contextual background, plausible distractors, a correct answer, and difficulty ratings from 1 to 5.
The benchmark’s coverage encompasses the following eleven subdomains:
- Alternative Investments & Real Estate
- Behavioral & Quantitative Finance
- Corporate Finance & Valuation
- FinTech & Innovation
- Financial Accounting & Reporting
- Financial Ethics & Governance
- Markets & Derivatives
- Regulation & Compliance
- Investment Portfolio Management 10. Personal Finance & Wealth Management
- Public & International Finance
Questions range from basic definitions and market knowledge to advanced multi-hop, counterfactual, and quantitative reasoning. For example, a representative quantitative item (difficulty = 3) is:
"A 5-year bond with a face value of \$1,000 pays annual coupons of 6%. If the yield to maturity is 5%, what is its Macaulay duration (to the nearest 0.1 year)? A. 4.33 B. 4.57 C. 4.75 D. 5.00"
This typifies the benchmark’s focus on real-world, calculation-driven, and context-rich evaluation (Matlin et al., 11 Jan 2026).
2. Semi-Synthetic Benchmark Pipeline
FinForge-5k is constructed through a four-stage hybrid pipeline, uniquely integrating expert-guided and programmatic methodologies:
a. Expert-Guided Data Curation. Domain experts devised a subdomain taxonomy and hand-selected authoritative sources (textbooks, regulatory documents, institutional research). Non-authoritative sources (forums, opinion blogs) were rigorously excluded for fidelity.
b. Programmatic Corpus Construction. Over 100,000 finance documents (>143 million tokens) were extracted and normalized using automated scrapers (Trafilatura, BeautifulSoup, PyMuPDF4LLM). Corpus assembly leveraged heuristics such as keyword co-occurrence and sitemap traversal for source ranking and filtration.
c. Structured Question Generation. Utilizing Gemini 2.5 Flash, a five-stage LM workflow analyzed documents to surface causal relations, core assumptions, alternative hypotheses, and potential counterfactual variants. An ‘answer plan’ specified target concept, embedded data, and difficulty. The LM-generated package included the question, four distractors, correct answer, and concise rationale.
d. Automated Validation (LM-as-Judge). Each Q/A was assessed by Gemini 2.5 Flash along five rubric dimensions: domain relevance, self-sufficiency, logical consistency, clarity, and complexity. Only Q/As passing all five criteria advanced.
This semi-synthetic approach, combining expert oversight and iterative LM refinement, ensures stringent content quality and domain alignment throughout the benchmark (Matlin et al., 11 Jan 2026).
3. Source Corpus Statistics and Authority
FinForge-5k draws on an authoritative corpus comprising:
| Attribute | Value | Examples |
|---|---|---|
| Number of Documents | >100,000 | Academic textbooks, policy papers, research |
| Total Corpus Size | 143 million tokens | “Core” and emerging financial topics |
| Source Institutions | Regulatory, academic, institutional | Basel Committee, CFA Institute, IMF |
Content coverage stretches from foundational finance (e.g., discounting, risk-return) to advanced or novel domains like decentralized finance, with stringent vetting for authority and relevance. The inclusion of policy documents and official releases ensures up-to-date and application-ready material.
4. Human-in-the-Loop and Automated Validation
The original pool of approximately 10,000 LM-generated items underwent intensive automated filtering, reducing the set to 5,000 by enforcing the five-dimension rubric. This automated stage was followed by an expert review of a stratified 10% sample (500 questions) by three finance practitioners.
- Expert Pass Rate: 70% of reviewed items were judged clear, accurate, and contextually self-contained.
- Expert-Flagged Issues (30%):
- Ambiguous phrasing or missing assumptions.
- Reliance on implicit figures/tables not provided.
This performance gap accentuates that, in high-stakes contexts, exclusive LM-driven validation is over-optimistic. Sustained domain expert oversight remains necessary for benchmark reliability and validity (Matlin et al., 11 Jan 2026).
5. Empirical Evaluation and Subdomain-Level Comparisons
The principal evaluation metric is accuracy, computed as
A comparative assessment across leading open- and closed-source LMs on the 5,000-item dataset yields the following overall accuracies:
| Model | Accuracy |
|---|---|
| Qwen 3 235B (open) | 0.771 |
| DeepSeek V3.1 (open) | 0.739 |
| GPT-4o (closed) | 0.734 |
| Claude Sonnet 4 (closed) | 0.726 |
| Llama 3.3 70B (open) | 0.725 |
| Qwen3-Next 80B (open) | 0.732 |
| OLMo 2 7B/32B (open) | 0.608/0.567 |
| Llama 4 Scout (open) | 0.465 |
Subdomain-level breakdowns reveal systematic differences in LM proficiency:
- Strongest: FinTech & Innovation (>0.94), Financial Ethics & Governance (>0.93)
- Weakest: Personal Finance & Wealth Management (~0.61–0.65), Corporate Finance & Valuation (~0.69–0.74)
- Intermediate: Markets & Derivatives (0.82–0.87), Portfolio Management (0.74–0.80)
These results demonstrate substantial heterogeneity both among models and across financial subfields, indicating persistent domain challenges (Matlin et al., 11 Jan 2026).
6. Limitations, Observed Model Failure Modes, and Directions for Improvement
Analysis of model outputs on FinForge-5k surfaces several predominant limitations:
- Quantitative Reasoning Weakness: LM accuracy drops markedly on multi-step arithmetic or data synthesis tasks. Arithmetic slip-ups are common even when the methodology is correct.
- Conceptual Failures: Models misapply financial logic or overlook fundamental assumptions, particularly on tasks requiring integration of tax, liquidity, and risk.
- Multi-Constraint and Multi-Hop Reasoning: Complexity increases dramatically for tasks with multiple constraints or counterfactual elements, especially in Personal Finance and Corporate Valuation.
- LM-as-Judge Shortcomings: Automated rubric validation is over-optimistic versus human experts—suggesting rubrics or validation LMs must be further finetuned.
Recommended improvements include:
- Integration of external programmatic or calculator modules to support arithmetic.
- Implementation of domain-adapted LMs or stricter rubric for LM-as-judge, maintaining human-in-the-loop review.
- Expansion to additional subdomains such as structured products or cryptocurrencies, and regular updating to track temporal market developments.
- Development of continual-learning benchmarks to guard against data leakage and maintain contemporary relevance.
FinForge-5k thus provides a rigorous, multi-dimensional benchmark for both diagnosing LM performance bottlenecks and informing the design of future finance-specialized LLMs (Matlin et al., 11 Jan 2026).