Using ChatGPT-4 for the Identification of Common UX Factors within a Pool of Measurement Items from Established UX Questionnaires

Published 20 Nov 2024 in cs.HC | (2411.13118v1)

Abstract: Measuring User Experience (UX) with standardized questionnaires is a widely used method. A questionnaire is based on different scales that represent UX factors and items. However, the questionnaires have no common ground concerning naming different factors and the items used to measure them. This study aims to identify general UX factors based on the formulation of the measurement items. Items from a set of 40 established UX questionnaires were analyzed by Generative AI (GenAI) to identify semantically similar items and to cluster similar topics. We used the LLM ChatGPT-4 for this analysis. Results show that ChatGPT-4 can classify items into meaningful topics and thus help to create a deeper understanding of the structure of the UX research field. In addition, we show that ChatGPT-4 can filter items related to a predefined UX concept out of a pool of UX items.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates ChatGPT-4's ability to classify 408 UX questionnaire items into evolving semantic groups.
It develops a hierarchical structure of 6 main topics and 15 sub-topics covering both pragmatic and hedonic UX aspects.
The study reveals that automated LLM analysis can enhance survey design and the creation of coherent UX measurement tools.

This paper investigates the use of ChatGPT-4, a LLM, to analyze and categorize measurement items from established User Experience (UX) questionnaires based on their semantic similarity (2411.13118). The core problem addressed is the lack of a common understanding and consistent terminology for UX factors and their corresponding measurement items across different standardized questionnaires. This inconsistency makes comparing results and building a unified view of UX challenging.

The primary goal was to determine if ChatGPT-4 could identify meaningful UX topics by clustering semantically similar items and, consequently, help structure the UX research field. The researchers also aimed to see if the LLM could filter items relevant to a specific, predefined UX concept from a large pool.

Methodology:

Data Collection: Items were extracted from 19 established UX questionnaires (selected from an initial list of 40, excluding those with semantic differential scales or very specific item formats). This resulted in a dataset of 408 measurement items.
Analysis using ChatGPT-4: The 408 items were fed into ChatGPT-4.
Prompting Strategy: A sequence of seven prompts guided the analysis:
- Initial prompts requested classification of similar items into topics (prompt1), followed by requests for more detailed breakdowns (prompt2, prompt3), and self-improvement of the categorization (prompt4).
- Another prompt asked ChatGPT-4 to compare its generated categories with a predefined list of 16 UX quality aspects from existing literature (prompt5).
- A subsequent prompt pushed for more generalized, holistic topics based on its previous categorizations (prompt6).
- The final prompt tested the model's ability to select items from the pool that specifically relate to the concept of "Learnability" (or "Perspicuity") (prompt7).

Key Findings:

Classification Ability (RQ1): ChatGPT-4 successfully classified the 408 items into semantically related groups. The classifications evolved from broad topics (e.g., Usability, Design, Engagement) to more granular sub-topics (e.g., Ease of Use, System Complexity, Visual Attraction). The process showed that the LLM could identify logical structures based purely on the text of the items.
Identified Topics (RQ2): Through iterative prompting, especially the generalization step (prompt6), ChatGPT-4 generated a hierarchical structure of 6 main UX topics and 15 sub-topics (detailed in Appendix A3 of the paper). These topics covered both pragmatic (e.g., Usability, Efficiency, Content Quality) and hedonic (e.g., Engagement, Aesthetics, Novelty) aspects, providing a relatively comprehensive view of UX. However, the authors noted some weaknesses, such as occasional mismatches between items and their assigned category or items fitting multiple categories.
Comparison with Literature: When compared to existing consolidated UX factors (prompt5), ChatGPT-4's categories showed overlap but also differences. The LLM's initial categorizations tended to be more specific or functionally focused than the established, more generalized factors.
Item Filtering: ChatGPT-4 demonstrated proficiency in identifying and selecting items relevant to a specific UX concept ("Learnability") when prompted (prompt7). The top items selected by the model were deemed highly relevant by the researchers (listed in Appendix A4).

Practical Implications:

Automated Item Analysis: LLMs offer a rapid and low-effort method for exploring semantic structures within large sets of UX measurement items, potentially uncovering relationships that might be missed in manual analysis.
Developing Measurement Tools: The AI-generated topic structures and item lists can serve as a starting point for developing new, potentially more semantically coherent UX questionnaires or measurement frameworks.
Item Selection for Surveys: Practitioners creating ad-hoc surveys can use LLMs to quickly find existing, relevant measurement items for specific UX aspects they want to investigate, rather than formulating new ones from scratch. This could improve the quality and consistency of self-made questionnaires.
Understanding UX Constructs: Analyzing items semantically helps differentiate between the meaning of items (semantic similarity) and how they might correlate in practice due to user perception effects (empirical similarity), leading to a deeper understanding of the UX construct itself.

Limitations and Future Work:

The study excluded semantic differential items, a common format in UX questionnaires. Future work could explore prompt engineering techniques to include diverse item formats.
The AI-generated classifications and item lists require further empirical validation to ensure their reliability and usefulness as measurement tools.
The non-deterministic nature of LLMs means results might vary slightly on repeated runs, although the authors argue this mirrors variability among human experts.

In conclusion, the paper demonstrates the potential of using LLMs like ChatGPT-4 as a tool in UX research to analyze questionnaire items, identify underlying semantic structures, and assist in finding relevant measures for specific UX concepts, thereby contributing to efforts to establish a more common ground in the field.