Can GPT models be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on mock CFA Exams

Published 12 Oct 2023 in cs.CL, cs.AI, and q-fin.GN | (2310.08678v1)

Abstract: LLMs have demonstrated remarkable performance on a wide range of NLP tasks, often matching or even beating state-of-the-art task-specific models. This study aims at assessing the financial reasoning capabilities of LLMs. We leverage mock exam questions of the Chartered Financial Analyst (CFA) Program to conduct a comprehensive evaluation of ChatGPT and GPT-4 in financial analysis, considering Zero-Shot (ZS), Chain-of-Thought (CoT), and Few-Shot (FS) scenarios. We present an in-depth analysis of the models' performance and limitations, and estimate whether they would have a chance at passing the CFA exams. Finally, we outline insights into potential strategies and improvements to enhance the applicability of LLMs in finance. In this perspective, we hope this work paves the way for future studies to continue enhancing LLMs for financial reasoning through rigorous evaluation.

Abstract PDF HTML Upgrade to Chat

References (29)

Citations (14)

View on Semantic Scholar

Summary

The paper evaluates ChatGPT and GPT-4 on mock CFA exams, revealing GPT-4's superior performance with a peak accuracy of 74.6% in Few-shot settings.
It employs Zero-shot, Chain-of-Thought, and Few-shot prompting to assess financial reasoning across topics like Derivatives and Alternative Investments.
The study highlights limitations in domain-specific knowledge, suggesting the need for integrated expertise to enhance LLMs for advanced financial analysis.

Can GPT Models Be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on Mock CFA Exams

The paper presents a comprehensive evaluation of two prominent LLMs, ChatGPT and GPT-4, on their ability to perform financial reasoning through the lens of mock Chartered Financial Analyst (CFA) exams. The study aims to scrutinize whether these LLMs can effectively address and solve questions from CFA Levels I and II, which are known for their rigorous testing of finance-related skills.

The experiment involved a detailed examination of the models under Zero-Shot (ZS), Chain-of-Thought (CoT), and Few-Shot (FS) prompting scenarios. The results showed that GPT-4 outperformed ChatGPT across most categories, especially in topics like Derivatives and Alternative Investments. However, both models faced challenges with topics that heavily require extensive financial domain knowledge and nuanced problem-solving, such as Financial Reporting and Portfolio Management.

The observation that GPT-4 consistently surpasses ChatGPT in terms of test accuracy reflects the enhanced capabilities of GPT-4 in reasoning and understanding complex problem spaces. For instance, GPT-4 scored higher in Level I with a peak accuracy of 74.6% in the FS setting, compared to ChatGPT's best score of 63.0% in a similar condition. In Level II, GPT-4 also showed commendable performance, achieving a top score suggesting that it potentially passes the CFA Level II exam depending on configuration.

A notable finding is the limited efficacy of CoT prompting for ChatGPT, with minor improvements noted when compared to ZS prompting on Level I exams. Although CoT seemingly enhances the ability of models to parse complex information by encouraging step-by-step reasoning, it simultaneously exposes gaps in domain-specific knowledge and computational accuracy, leading to a range of errors. GPT-4 showed a more pronounced benefit from CoT in some contexts, especially when tackling the elaborate Level II questions, although this did not universally surpass FS prompting.

The implications of this study extend to the potential use of LLMs in financial domains. While the models demonstrate a level of competency in financial reasoning under specific conditions, they still show significant limitations when faced with intricate domain-specific problems. This insight calls for a layered approach to enhance LLMs' capabilities, suggesting a potential integration of specialized knowledge bases and advanced computation tools to mitigate their current deficiencies.

Future prospects involve amplifying the financial reasoning capabilities of LLMs by incorporating retrieval-augmented generation paradigms, integrating domain-specific pre-training, and potentially leveraging external calculation modules. Emphasizing the combination of FS and CoT could also yield models with improved problem-solving accuracy. As such, the research opens a path for further explorations into optimizing LLMs for specialized applications within the financial sector, potentially reshaping the landscape of automated financial analysis tools.