Papers
Topics
Authors
Recent
Search
2000 character limit reached

CFinBench: Chinese Financial NLP Benchmark

Updated 26 January 2026
  • CFinBench is a comprehensive Chinese financial evaluation benchmark that assesses LLMs' domain-specific knowledge and reasoning through 99,100 expert-curated questions.
  • It employs diverse question formats—including single-choice, multiple-choice, and judgment—to test factual accuracy, regulatory understanding, and practical skills.
  • Experimental results highlight varying model performance, with top models reaching over 60% accuracy and emphasizing the need for enhanced legal and compliance reasoning.

CFinBench is a large-scale, systematic evaluation benchmark targeting the financial knowledge and reasoning abilities of LLMs in a Chinese context. Developed to address limitations in the coverage, granularity, and realism of prior Chinese financial NLP benchmarks, CFinBench sets a new standard in scope and methodological robustness for the assessment of domain-specific LLM competency in financial fields (Nie et al., 2024).

1. Scope and Dataset Structure

CFinBench comprises 99,100 expert-curated questions, hierarchically organized into four primary categories reflecting the skillset and knowledge base required by Chinese financial professionals:

  • Financial Subject: Assesses theoretical knowledge in foundational areas such as political economy, micro/macroeconomics, statistics, auditing, economic history, and finance.
  • Financial Qualification: Focuses on the content of Ministry of Finance, securities, banking, and certifying exams (e.g., Certified Public Accountant, Securities Practitioner, Tax Practitioner).
  • Financial Practice: Evaluates capacities needed for occupational roles including tax consultancy, various accountant and economist grades, asset appraisal, and securities analysis.
  • Financial Law: Encapsulates the requirements of domain-specific legal frameworks: tax law, economic law, civil law, banking law, insurance law, commercial law, and related statutes.

Within these, 43 second-level subcategories cover both horizontal (cross-discipline) and vertical (specialization) facets, mapped closely to real exam blueprints and job requirements. Question sources are primarily mock exams and internal training materials, with PDF/EPUB/Word parsing, fastText filtering, and MinHash de-duplication ensuring dataset integrity (Nie et al., 2024).

2. Question Formats and Distribution

CFinBench introduces three question types for coverage of factual, reasoning, and compliance skills:

  • Single-choice: Four options, with exactly one correct; 44,425 questions (44.8%).
  • Multiple-choice: Four or five options, at least two correct; 29,625 questions (29.9%).
  • Judgment: Binary true/false statements; 25,050 questions (25.3%).

Category and subcategory distributions reflect both occupational frequency and regulatory priorities. For example, the Financial Practice section includes high-volume subcategories (e.g., junior/intermediate accountant and economist, asset appraiser), while Financial Law spans domain-specific regulatory fields and exam segments (Nie et al., 2024).

First-level Category Total Questions Example Subcategories
Financial Subject 9,106 Macroeconomics, Statistics
Financial Qualification 29,388 CPA, Securities Practitioner
Financial Practice 42,045 Tax Consultant, Asset Appraiser
Financial Law 18,561 Tax Law, Economic Law

3. Construction and Quality Control

Question formulations leverage both public and proprietary test sources. Data curation protocols include:

  • Parsing and Cleaning: Format conversion, non-Chinese text screening, and removal of garbled content.
  • Contamination Management: MinHash-based de-duplication with both intra-dataset and cross-dataset matching.
  • Diversity Augmentation: Answer option shuffling (random and "farthest-swap" schemes), GPT-4-assisted surface form rewriting, and conversion of question types to increase robustness against memorization and shallow heuristics.
  • Validation: Multi-stage human-in-the-loop checks for accuracy and linguistic fluency.

These steps minimize redundancy and contamination while promoting data diversity and alignment with authentic practice environments (Nie et al., 2024).

4. Evaluation Protocols and Metrics

CFinBench supports both zero-shot and few-shot evaluations in standardized "answer-only" mode. Decoding settings include greedy search (temperature=1.0, top-p=1.0), with input and output length limits tailored to financial questions (max input 2048 tokens).

Scoring per question type:

  • Single-choice: Answer is correct only if the first predicted option matches the gold.
  • Multiple-choice: Score is zero if any chosen option is incorrect; else, the score is the proportion of correctly selected options:

Score=#correct options predicted#true correct options\textrm{Score} = \frac{\# \textrm{correct options predicted}}{\# \textrm{true correct options}}

  • Judgment: Binary true/false match.

For each category jj:

Accj=number of correctly answered questions in jtotal questions in j\textrm{Acc}_j = \frac{\textrm{number of correctly answered questions in } j}{\textrm{total questions in } j}

Aggregate scoring uses a weighted sum:

Scorej=0.4 Accjsingle+0.4 Accjmulti+0.2 Accjjudg\textrm{Score}_j = 0.4\,\textrm{Acc}^{\textrm{single}}_j + 0.4\,\textrm{Acc}^{\textrm{multi}}_j + 0.2\,\textrm{Acc}^{\textrm{judg}}_j

The overall benchmark accuracy is given by:

Accavg=1M∑j=1MAccj\textrm{Acc}_{\mathrm{avg}} = \frac{1}{M} \sum_{j=1}^M \textrm{Acc}_j

where MM is the number of first-level categories (Nie et al., 2024).

5. Model Coverage and Experimental Results

Fifty representative LLMs were evaluated, spanning GPT-4 and Chinese-optimized large models (Qwen, Yi, XuanYuan, YunShan, etc.) between <5B and >65B parameters. Both API-based and open-source architectures are included. Primary findings:

  • Best-performing model: Yi1.5-34B with 60.16% average three-shot accuracy.
  • Leading open Chinese models: Qwen-72B (58.56%), Qwen1.5-72B (58.10%), XuanYuan2-70B-Base (56.69%).
  • GPT-4: 54.69% average accuracy.
  • Compact, domain-adapted models: YunShan-7B achieves 52.45%, outperforming many larger baselines.

Law-oriented and compliance subcategories (Insurance Law, Securities Law) remain particularly challenging for all categories of models, with many scoring below 50%. Chain-of-thought prompting yields little consistent improvement in accuracy relative to "answer-only" settings (Nie et al., 2024).

6. Limitations and Prospective Directions

CFinBench, while comprehensive, is limited in several respects:

  • Static Coverage: Content reflects regulatory and curricular standards current at release; dynamic updating to reflect new regulations and financial products is planned.
  • Unimodal Focus: Current tasks are text-based; future work aims to incorporate multimodal documents (e.g., tables, charts, scanned forms).
  • Language Scope: Benchmarking is restricted to Chinese; cross-lingual extensions and comparative studies across national financial systems are proposed.
  • Reasoning Integration: Stronger integration of retrieval and symbolic reasoning (e.g., to real-time databases) is a future goal.

Additional directions include the development of benchmarks reflecting job-specific workflows, real-world case analysis, and safety/adversarial robustness, further aligning with evolving practices in financial LLM deployment (Nie et al., 2024).

7. Significance and Impact

CFinBench establishes a new standard for the systematic evaluation of LLMs in high-stakes, domain-specialized environments. Its core contributions include unprecedented scale across category and question types, alignment with practical professional competency requirements, rigorous contamination control, and openly available data and code for reproducibility. Findings from CFinBench reveal that the state-of-the-art models, while strong, leave substantial headroom for improvements in precision, law compliance, and reasoning depth, especially in the context of Chinese financial practice. The benchmark is positioned to facilitate ongoing research into robust, compliant, and contextually aligned LLMs for the financial industry (Nie et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CFinBench.