BizFinBench.v2: Financial LLM Benchmark
- BizFinBench.v2 is a comprehensive, bilingual financial benchmark that evaluates LLM performance using real user queries from major equity markets.
- It systematically integrates offline and online tasks to bridge the gap between simulated evaluations and real-time business-critical financial workflows.
- It employs rigorous human validation, high-dimensional clustering, and dynamic scenario assessments to provide actionable insights for deploying financial LLMs.
BizFinBench.v2 is a large-scale, bilingual, dual-mode benchmark designed to evaluate financial capabilities of LLMs in business-critical contexts. Comprising 29,578 expert-level Q&A pairs grounded in authentic queries from both Chinese “A”-share and U.S. equity market platforms, BizFinBench.v2 aims to address the disconnect between traditional, static, simulated benchmarks and the operational realities faced by institutions and retail participants. The benchmark integrates both offline and online task assessments, systematizing rigorous alignment with real-time business needs and domain-specific workflows while enabling precise measurement of LLM performance in financial service deployment (Guo et al., 10 Jan 2026).
1. Motivation and Advancement over Prior Benchmarks
Existing benchmarks targeting financial LLMs predominantly use synthetic or general-purpose data, with an emphasis on narrow, offline Q&A scenarios. This results in a lack of authenticity and no means to assess real-time responsiveness, conditions necessary for operational finance. Traditional datasets, such as those based on simulated filings or white papers, omit the noise, complexity, and dynamic feedback loops present in genuine user queries on financial platforms. Consequently, benchmark results poorly predict real-world efficacy in settings like trading arenas, research desks, or advisory platforms.
BizFinBench.v2 responds to these deficits by:
- Utilizing real user queries representing both retail and institutional participants across major markets (China and U.S.).
- Covering both offline (static, document-based) and online (live, time-sensitive) scenarios.
- Offering a taxonomy of fundamental business tasks, reflecting actual workflows encountered in financial services.
- Providing human-validated Q&A pairs after rigorous quality control, clustering, and desensitization—from nearly unfiltered platform logs to high-fidelity, domain-specific data (Guo et al., 10 Jan 2026).
2. Data Collection and Construction Workflow
The dataset construction pipeline commences with the acquisition of millions of raw queries from leading Chinese and U.S. equity market platforms. After automated desensitization, queries are subjected to high-dimensional embedding, followed by clustering via the k-means objective:
where are embedding vectors for each query and the centroid for cluster . Silhouette metrics guide cluster validity. This process yields semantically coherent groups, subsequently mapped to a ten-task taxonomy by domain experts.
Validation involves a tri-stage human annotation workflow:
- Frontline business staff review,
- Expert cross-validation,
- Consistency and logical verification.
The final dataset encompasses 29,578 bilingual Q&A pairs, structured over four business scenarios and representing eight offline and two online financial tasks. All data, scripts, and simulators are open-sourced at https://github.com/HiThink-Research/BizFinBench.v2 (Guo et al., 10 Jan 2026).
3. Task Taxonomy and Scenarios
BizFinBench.v2 defines four business scenarios, each with tailored task types designed to stress critical, expert-level financial reasoning. Tasks are divided as follows:
| Scenario (abbr.) | Task Name (abbr.) | Core Input/Output |
|---|---|---|
| Business Information Provenance | AIT, FMP, FDD | Data tracing, multi-turn dialogue, error detection |
| Financial Logic Reasoning | FQC, ELR, CI | Computation, event ordering, counterfactual reasoning |
| Stakeholder Feature Perception | SA, FRA | Sentiment scoring, report-based financial ranking |
| Real-time Market Discernment | SPP, PAA | Price prediction, live portfolio allocation |
A more detailed breakdown:
- Anomaly Information Tracing (AIT): Identifies causal data points for price anomalies from noisy, multi-source data.
- Financial Multi-turn Perception (FMP): Pinpoints semantically correct responses within extended chat logs.
- Financial Data Description (FDD): Detects logical or numerical inconsistencies within data description sets.
- Financial Quantitative Computation (FQC): Produces precise numerical outputs or marks queries unanswerable, based on relevant/irrelevant input data pools.
- Event Logic Reasoning (ELR): Orders sequences of events by chronology or causality.
- Counterfactual Inference (CI): Derives conclusions from “what if” scenarios leveraging historical industry and policy data.
- User Sentiment Analysis (SA): Assigns sentiment intervals to users based on contextual profiles and market/news data, with success measured by conformal coverage.
- Financial Report Analysis (FRA): Ranks firm performance from multiple quarterly statements.
- Stock Price Prediction (SPP): Gives interval forecasts for closing prices using one month of historical, multi-modal input.
- Portfolio Asset Allocation (PAA): Executes hourly trading decisions using live data and measures success using Total Return (TR), Sharpe Ratio (SR), Max Drawdown (MD), and Profit Factor (PF) (Guo et al., 10 Jan 2026).
4. Evaluation Protocols and Metrics
BizFinBench.v2 employs rigorous, scenario-aligned evaluation metrics:
- Offline Tasks: Accuracy is defined as
with zero-shot evaluation (no in-context examples).
- Sentiment Analysis and Price Prediction: Conformal prediction intervals are successful if ground truth satisfies , with tolerances of 10% and , respectively.
- Portfolio Asset Allocation (PAA): Evaluated on TR, SR, MD as
with all online evaluation running live through December 24, 2025 and actual fees/slippage incorporated.
- Task-Specific Judgement: Direct human accuracy and ranking for non-quantitative sub-tasks.
A key design property is the inclusion of real-time, online tasks, which directly assess an LLM’s ability to adapt to dynamic market states—criteria absent from prior benchmarks (Guo et al., 10 Jan 2026).
5. Empirical Findings and Model Ranking
Major experimental results for both proprietary and open-source LLMs highlight the current capability gap:
| Model | Offline Accuracy (%) | Online PAA TR (%) | Online SPP Coverage (%) | Sentiment Coverage (%) |
|---|---|---|---|---|
| ChatGPT-5 | 61.5 | –0.92 | – | – |
| Gemini-3 | 61.3 | – | – | – |
| Doubao-Seed-1.6 | 59.4 | – | – | – |
| Qwen3-235B-A22B | 53.3 | +8.43 | – | – |
| DeepSeek-R1 | – | +13.46 (SR=1.8) | – | – |
| Market Benchmark (SPY) | – | +4.74 | – | – |
Stock price prediction (SPP) shows best-case interval coverage at 36.9%; sentiment analysis achieves just 23.5% interval coverage. Even the best proprietary LLMs (ChatGPT-5) lag financial experts by ~23 percentage points (84.8% for experts vs. 61.5% model accuracy). In online asset allocation, DeepSeek-R1 achieves the highest TR (+13.46%), outperforming even SPY (+4.74%), while ChatGPT-5 yields net losses (–0.92%). Financial-domain LLMs such as Dianjin-R1 and FinX1 lag considerably, with 35.7% and 27.9% accuracy, respectively (Guo et al., 10 Jan 2026).
6. Error Analysis and Deficiency Taxonomy
Manual categorization (20% random sample) of erroneous outputs surfaces five persistent, business-relevant failure types:
- Financial Semantic Deviation: Domain-specific term misinterpretation.
- Long-term Business Logic Discontinuity: Broken multi-step logical inference chains.
- Multivariate Integrated Analysis Deviation: Inadequate weighting of heterogeneous data sources.
- High-precision Computational Distortion: Numerical or unit errors.
- Financial Time-Series Logical Disorder: Temporal or causal misorderings in time-series.
These findings indicate shortfalls in complex integration, sustained logical coherence, and high-precision reasoning—crucial for financial deployment. A plausible implication is the necessity for targeted pre-training, extended context capacity, and specialized tool integration to bridge the expert-level gap observed in live business settings (Guo et al., 10 Jan 2026).
7. Open Resources and Infrastructure
BizFinBench.v2 data, formal task descriptions, evaluation scripts, and the full Portfolio Asset Allocation simulator are provided openly at https://github.com/HiThink-Research/BizFinBench.v2. This resource enables the broader research community to perform precise, business-aligned evaluations and accelerates transparent progress assessment for LLMs in the global financial domain (Guo et al., 10 Jan 2026).