Papers
Topics
Authors
Recent
Search
2000 character limit reached

FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models

Published 1 Jan 2024 in cs.CL and cs.AI | (2401.02982v4)

Abstract: LLMs have demonstrated impressive capabilities across a wide range of tasks. However, their proficiency and reliability in the specialized domain of financial data analysis, particularly focusing on data-driven thinking, remain uncertain. To bridge this gap, we introduce \texttt{FinDABench}, a comprehensive benchmark designed to evaluate the financial data analysis capabilities of LLMs within this context. \texttt{FinDABench} assesses LLMs across three dimensions: 1) \textbf{Foundational Ability}, evaluating the models' ability to perform financial numerical calculation and corporate sentiment risk assessment; 2) \textbf{Reasoning Ability}, determining the models' ability to quickly comprehend textual information and analyze abnormal financial reports; and 3) \textbf{Technical Skill}, examining the models' use of technical knowledge to address real-world data analysis challenges involving analysis generation and charts visualization from multiple perspectives. We will release \texttt{FinDABench}, and the evaluation scripts at \url{https://github.com/cubenlp/BIBench}. \texttt{FinDABench} aims to provide a measure for in-depth analysis of LLM abilities and foster the advancement of LLMs in the field of financial data analysis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Qwen technical report.
  2. Felm: Benchmarking factuality evaluation of large language models. arXiv preprint arXiv:2310.00741.
  3. Longlora: Efficient fine-tuning of long-context large language models. arXiv:2309.12307.
  4. Convfinqa: Exploring the chain of numerical reasoning in conversational finance question answering. Proceedings of EMNLP 2022.
  5. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  6. Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
  7. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 1:4171–4186.
  9. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  10. Lawbench: Benchmarking legal knowledge of large language models. arXiv preprint arXiv:2309.16289.
  11. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  12. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.
  13. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
  14. Exploring the impact of instruction data scaling on large language models: An empirical study on real-world use cases.
  15. David R Krathwohl. 2002. A revision of bloom’s taxonomy: An overview. Theory into practice, 41(4):212–218.
  16. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.
  17. Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464.
  18. Fingpt: Democratizing internet-scale data for financial large language models. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
  19. Bbt-fin: Comprehensive construction of chinese financial domain pre-trained language model, corpus and benchmark. arXiv preprint arXiv:2302.09432.
  20. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, 56(2):1–40.
  21. Disentqa: Disentangling parametric and contextual knowledge with counterfactual question answering. arXiv preprint arXiv:2211.05655.
  22. OpenAI. 2022. Introducing chatgpt. https://openai.com/blog/chatgpt. Accessed on December 28, 2023.
  23. OpenAI. 2023. Gpt-4 technical report.
  24. semipqa: A study on product question answering over semi-structured data. In Proceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5), pages 111–120.
  25. Large language models encode clinical knowledge. Nature, 620(7972):172–180.
  26. Moss: Training conversational language models from synthetic data.
  27. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
  28. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  29. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  30. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  31. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  32. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  33. Pixiu: A large language model, instruction data and evaluation benchmark for finance.
  34. Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712.
  35. Finbert: A pretrained language model for financial communications.
  36. Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales. arXiv preprint arXiv:2308.01320.
  37. Kola: Carefully benchmarking world knowledge of large language models. arXiv preprint arXiv:2306.09296.
  38. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task.
  39. Generate rather than retrieve: Large language models are strong context generators. arXiv preprint arXiv:2209.10063.
  40. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
  41. Fineval: A chinese financial domain knowledge evaluation benchmark for large language models.
  42. Xuanyu Zhang and Qing Yang. 2023. Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 4435–4439.
  43. A survey of large language models. arXiv preprint arXiv:2303.18223.
  44. Agieval: A human-centric benchmark for evaluating foundation models.
  45. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
  46. Dataset quantization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17205–17216.
Citations (2)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.