CFinBench: Chinese Financial NLP Benchmark
- CFinBench is a comprehensive Chinese financial evaluation benchmark that assesses LLMs' domain-specific knowledge and reasoning through 99,100 expert-curated questions.
- It employs diverse question formats—including single-choice, multiple-choice, and judgment—to test factual accuracy, regulatory understanding, and practical skills.
- Experimental results highlight varying model performance, with top models reaching over 60% accuracy and emphasizing the need for enhanced legal and compliance reasoning.
CFinBench is a large-scale, systematic evaluation benchmark targeting the financial knowledge and reasoning abilities of LLMs in a Chinese context. Developed to address limitations in the coverage, granularity, and realism of prior Chinese financial NLP benchmarks, CFinBench sets a new standard in scope and methodological robustness for the assessment of domain-specific LLM competency in financial fields (Nie et al., 2024).
1. Scope and Dataset Structure
CFinBench comprises 99,100 expert-curated questions, hierarchically organized into four primary categories reflecting the skillset and knowledge base required by Chinese financial professionals:
- Financial Subject: Assesses theoretical knowledge in foundational areas such as political economy, micro/macroeconomics, statistics, auditing, economic history, and finance.
- Financial Qualification: Focuses on the content of Ministry of Finance, securities, banking, and certifying exams (e.g., Certified Public Accountant, Securities Practitioner, Tax Practitioner).
- Financial Practice: Evaluates capacities needed for occupational roles including tax consultancy, various accountant and economist grades, asset appraisal, and securities analysis.
- Financial Law: Encapsulates the requirements of domain-specific legal frameworks: tax law, economic law, civil law, banking law, insurance law, commercial law, and related statutes.
Within these, 43 second-level subcategories cover both horizontal (cross-discipline) and vertical (specialization) facets, mapped closely to real exam blueprints and job requirements. Question sources are primarily mock exams and internal training materials, with PDF/EPUB/Word parsing, fastText filtering, and MinHash de-duplication ensuring dataset integrity (Nie et al., 2024).
2. Question Formats and Distribution
CFinBench introduces three question types for coverage of factual, reasoning, and compliance skills:
- Single-choice: Four options, with exactly one correct; 44,425 questions (44.8%).
- Multiple-choice: Four or five options, at least two correct; 29,625 questions (29.9%).
- Judgment: Binary true/false statements; 25,050 questions (25.3%).
Category and subcategory distributions reflect both occupational frequency and regulatory priorities. For example, the Financial Practice section includes high-volume subcategories (e.g., junior/intermediate accountant and economist, asset appraiser), while Financial Law spans domain-specific regulatory fields and exam segments (Nie et al., 2024).
| First-level Category | Total Questions | Example Subcategories |
|---|---|---|
| Financial Subject | 9,106 | Macroeconomics, Statistics |
| Financial Qualification | 29,388 | CPA, Securities Practitioner |
| Financial Practice | 42,045 | Tax Consultant, Asset Appraiser |
| Financial Law | 18,561 | Tax Law, Economic Law |
3. Construction and Quality Control
Question formulations leverage both public and proprietary test sources. Data curation protocols include:
- Parsing and Cleaning: Format conversion, non-Chinese text screening, and removal of garbled content.
- Contamination Management: MinHash-based de-duplication with both intra-dataset and cross-dataset matching.
- Diversity Augmentation: Answer option shuffling (random and "farthest-swap" schemes), GPT-4-assisted surface form rewriting, and conversion of question types to increase robustness against memorization and shallow heuristics.
- Validation: Multi-stage human-in-the-loop checks for accuracy and linguistic fluency.
These steps minimize redundancy and contamination while promoting data diversity and alignment with authentic practice environments (Nie et al., 2024).
4. Evaluation Protocols and Metrics
CFinBench supports both zero-shot and few-shot evaluations in standardized "answer-only" mode. Decoding settings include greedy search (temperature=1.0, top-p=1.0), with input and output length limits tailored to financial questions (max input 2048 tokens).
Scoring per question type:
- Single-choice: Answer is correct only if the first predicted option matches the gold.
- Multiple-choice: Score is zero if any chosen option is incorrect; else, the score is the proportion of correctly selected options:
- Judgment: Binary true/false match.
For each category :
Aggregate scoring uses a weighted sum:
The overall benchmark accuracy is given by:
where is the number of first-level categories (Nie et al., 2024).
5. Model Coverage and Experimental Results
Fifty representative LLMs were evaluated, spanning GPT-4 and Chinese-optimized large models (Qwen, Yi, XuanYuan, YunShan, etc.) between <5B and >65B parameters. Both API-based and open-source architectures are included. Primary findings:
- Best-performing model: Yi1.5-34B with 60.16% average three-shot accuracy.
- Leading open Chinese models: Qwen-72B (58.56%), Qwen1.5-72B (58.10%), XuanYuan2-70B-Base (56.69%).
- GPT-4: 54.69% average accuracy.
- Compact, domain-adapted models: YunShan-7B achieves 52.45%, outperforming many larger baselines.
Law-oriented and compliance subcategories (Insurance Law, Securities Law) remain particularly challenging for all categories of models, with many scoring below 50%. Chain-of-thought prompting yields little consistent improvement in accuracy relative to "answer-only" settings (Nie et al., 2024).
6. Limitations and Prospective Directions
CFinBench, while comprehensive, is limited in several respects:
- Static Coverage: Content reflects regulatory and curricular standards current at release; dynamic updating to reflect new regulations and financial products is planned.
- Unimodal Focus: Current tasks are text-based; future work aims to incorporate multimodal documents (e.g., tables, charts, scanned forms).
- Language Scope: Benchmarking is restricted to Chinese; cross-lingual extensions and comparative studies across national financial systems are proposed.
- Reasoning Integration: Stronger integration of retrieval and symbolic reasoning (e.g., to real-time databases) is a future goal.
Additional directions include the development of benchmarks reflecting job-specific workflows, real-world case analysis, and safety/adversarial robustness, further aligning with evolving practices in financial LLM deployment (Nie et al., 2024).
7. Significance and Impact
CFinBench establishes a new standard for the systematic evaluation of LLMs in high-stakes, domain-specialized environments. Its core contributions include unprecedented scale across category and question types, alignment with practical professional competency requirements, rigorous contamination control, and openly available data and code for reproducibility. Findings from CFinBench reveal that the state-of-the-art models, while strong, leave substantial headroom for improvements in precision, law compliance, and reasoning depth, especially in the context of Chinese financial practice. The benchmark is positioned to facilitate ongoing research into robust, compliant, and contextually aligned LLMs for the financial industry (Nie et al., 2024).