How Effective are Generative Large Language Models in Performing Requirements Classification?

Published 23 Apr 2025 in cs.CL, cs.AI, and cs.SE | (2504.16768v1)

Abstract: In recent years, transformer-based LLMs have revolutionised NLP, with generative models opening new possibilities for tasks that require context-aware text generation. Requirements engineering (RE) has also seen a surge in the experimentation of LLMs for different tasks, including trace-link detection, regulatory compliance, and others. Requirements classification is a common task in RE. While non-generative LLMs like BERT have been successfully applied to this task, there has been limited exploration of generative LLMs. This gap raises an important question: how well can generative LLMs, which produce context-aware outputs, perform in requirements classification? In this study, we explore the effectiveness of three generative LLMs-Bloom, Gemma, and Llama-in performing both binary and multi-class requirements classification. We design an extensive experimental study involving over 400 experiments across three widely used datasets (PROMISE NFR, Functional-Quality, and SecReq). Our study concludes that while factors like prompt design and LLM architecture are universally important, others-such as dataset variations-have a more situational impact, depending on the complexity of the classification task. This insight can guide future model development and deployment strategies, focusing on optimising prompt structures and aligning model architectures with task-specific needs for improved performance.

Abstract PDF Upgrade to Chat

Summary

An Analysis of the Effectiveness of Generative Large Language Models in Requirements Classification

The paper "How Effective are Generative Large Language Models in Performing Requirements Classification?" investigates the application of generative Large Language Models (LLMs) to the task of requirements classification within Requirements Engineering (RE). This study represents a significant contribution to understanding the potential and limitations of generative LLMs in this domain, typically dominated by non-generative models like BERT.

Experimental Design and Objectives

The research explores three generative LLMs—Bloom, Gemma, and Llama—evaluating their performance through extensive experimentation involving over 400 different conditions across datasets, models, prompts, and tasks. The primary aim is to determine how well these models function compared to non-generative counterparts, especially considering previous success observed with non-generative models in RE tasks.

Key Findings

Performance Metrics:
- Bloom demonstrated superior precision (up to 0.77), particularly in binary classification tasks.
- Gemma excelled in recall (average of 0.51 in binary classification), indicating its strength in identifying requirements across varying conditions.
- Llama provided balanced results, suggesting it is well-suited for both binary and multi-class tasks.
Prompt Engineering:
- Assertion-based prompts appeared most effective, underscoring the importance of prompt design in leveraging model capabilities fully.
Dataset Robustness:
- Generative models maintained consistent performance despite variations in dataset format, highlighting their robustness in processing requirements inputs.
Comparative Analysis with Non-Generative Models:
- Generative models did not uniformly surpass non-generative models like All-Mini and SBERT in multi-class classification scenarios, where non-generative models showed better performance.

Implications

The paper's findings indicate that while generative LLMs bring promising capabilities for requirements classification, the choice of model and prompt structure can significantly influence outcomes. These models exhibit strength in certain metrics (e.g., recall for Gemma) but haven't fully eclipsed non-generative models in broader settings, particularly in complex multi-class tasks.

Speculations and Future Directions

Given these results, continued exploration of generative LLMs should consider focusing on optimizing prompts and refining architectural strategies that enhance multi-class discrimination capabilities. As transformer architectures continue to evolve, the potential of generative LLMs to integrate seamlessly into AI-driven requirements engineering frameworks could become more promising, especially with advancements in fine-tuning and task adaptation methodologies.

In summary, this paper contributes a foundational understanding of generative LLMs' role in requirements classification, offering groundwork for future research that could better integrate these models into RE practices.