Exploring Zero-Shot App Review Classification with ChatGPT: Challenges and Potential

Published 7 May 2025 in cs.SE and cs.AI | (2505.04759v1)

Abstract: App reviews are a critical source of user feedback, offering valuable insights into an app's performance, features, usability, and overall user experience. Effectively analyzing these reviews is essential for guiding app development, prioritizing feature updates, and enhancing user satisfaction. Classifying reviews into functional and non-functional requirements play a pivotal role in distinguishing feedback related to specific app features (functional requirements) from feedback concerning broader quality attributes, such as performance, usability, and reliability (non-functional requirements). Both categories are integral to informed development decisions. Traditional approaches to classifying app reviews are hindered by the need for large, domain-specific datasets, which are often costly and time-consuming to curate. This study explores the potential of zero-shot learning with ChatGPT for classifying app reviews into four categories: functional requirement, non-functional requirement, both, or neither. We evaluate ChatGPT's performance on a benchmark dataset of 1,880 manually annotated reviews from ten diverse apps spanning multiple domains. Our findings demonstrate that ChatGPT achieves a robust F1 score of 0.842 in review classification, despite certain challenges and limitations. Additionally, we examine how factors such as review readability and length impact classification accuracy and conduct a manual analysis to identify review categories more prone to misclassification.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

Exploring Zero-Shot App Review Classification with ChatGPT: A Critical Examination

The paper "Exploring Zero-Shot App Review Classification with ChatGPT: Challenges and Potential" authored by Mohit Chaudhary et al. presents a study on leveraging zero-shot capabilities of the Large Language Model (LLM) ChatGPT for classifying app reviews. The objective is to categorize app reviews into four distinct categories of requirements: functional requirement (FR), non-functional requirement (NFR), both, or neither, addressing challenges posed by conventional machine learning approaches in requirement classification.

Methodology

The research utilizes a curated benchmark dataset comprising 1,880 manually annotated app reviews from ten broadly varied applications. This dataset serves as the basis for evaluating the efficacy of ChatGPT's zero-shot learning capability. Manual annotation was executed by five experienced software engineers to ensure the validity of the dataset. The classification task was executed using Azure OpenAI’s GPT-4o mini model under zero-shot configurations, with performance assessed through precision, recall, and micro F1 score metrics.

Several prompt designs were tested to refine ChatGPT's ability to accurately classify reviews. These include techniques like Emotion Prompting, Role Prompting, and Chain of Thought (CoT) prompting. The optimal result was achieved by employing a composite prompting strategy (Prompt 3), which combined these techniques, yielding a notably high F1 score of 0.842.

Results and Analysis

The zero-shot ChatGPT model demonstrated superior performance compared to traditional ML classification models such as Random Forests, Decision Trees, and Support Vector Machines. The model's classification effectiveness was notably influenced by prompt design and temperature settings, with lower temperature settings resulting in improved accuracy. A significant observation was the model's ability to maintain classification accuracy irrespective of review length, while being significantly impacted by review complexity and readability, measured via the Flesch Kincaid Grade Level (FKGL).

Specific types of reviews, especially those containing overlapping characteristics of FRs and NFRs, posed challenges leading to misclassification. The analysis identified bias towards strong sentiment, overlap of requirements, language ambiguity, and emotional content as factors leading to misclassification.

Implications and Future Directions

The study highlights the practical implications of using ChatGPT for automated requirement classification in app reviews, potentially streamlining development processes and enhancing developer responsiveness to user feedback without reliance on extensive labeled datasets. This approach offers a time-efficient, scalable solution adaptable to rapidly evolving app ecosystems.

The findings pave the way for further investigation into prompt refinement strategies, which might enhance classification precision, particularly in addressing challenges identified in overlapping or ambiguous reviews. Incorporating NLP techniques such as review simplification could further refine the model's accuracy by mitigating issues arising from complex language. Additionally, exploring adaptive techniques to offset misclassification biases linked to sentiment and overlapping requirement themes presents a future research trajectory.

In summary, the paper offers a significant contribution to the field of software engineering by effectively leveraging modern AI capabilities to overcome traditional limitations in app review analysis, introducing robust techniques that hold promise for improved requirement understanding and enhanced application development strategies.