What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks

Published 10 Apr 2025 in cs.CL | (2504.07825v1)

Abstract: Common-sense reasoning is a key LLM capability because it encapsulates not just specific factual knowledge but rather general language and world understanding. Measuring common-sense reasoning, therefore, is crucial for LLMs of different sizes and applications. One of the most widely used benchmarks for evaluating such capabilities is HellaSwag; however, in this paper, we show that it has severe construct validity issues. These issues range from basic ungrammaticality and numerous typos to misleading prompts or equally correct options. Furthermore, we show that if models are evaluated only on answer texts, or with "Lorem ipsum dolor..." instead of the question, more than 65% of model predictions remain the same, and this cannot be attributed merely to contamination. Since benchmark scores are an essential part of model selection in both research and commercial applications, these validity issues can have severe consequences. In particular, knowing that taking benchmark scores at face value is ubiquitous, inadequate evaluation leads to ill-informed decisions about models. In this paper, we thoroughly investigate critical validity issues posed by HellaSwag and illustrate them with various evaluations using generative LLMs of different sizes. We argue that this benchmark does not accurately measure common-sense reasoning and, therefore, should not be used for evaluation in its current state. Based on the results of our study, we propose requirements that should be met by future common-sense reasoning benchmarks. In addition, we release GoldenSwag, a corrected subset of HellaSwag, which, to our belief, facilitates acceptable common-sense reasoning evaluation.

Abstract PDF Upgrade to Chat

Summary

On the Validity of Common-Sense Reasoning Benchmarks: A Critical Examination of HellaSwag

The paper "What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks" delves into the construct validity issues within HellaSwag, a benchmark traditionally utilized to evaluate common-sense reasoning capabilities in large language models. The authors, Pavel Chizhov et al., provide a comprehensive analysis indicating that HellaSwag, while popular, may not effectively measure the intended reasoning abilities due to several identified shortcomings.

Core Findings and Analytical Approaches

The authors identify a series of deficiencies within the HellaSwag benchmark:

Ungrammaticality and Typos: The paper points out that ungrammatical sentences and typographical errors are pervasive throughout the dataset, particularly within prompt texts and incorrect options. Such errors may skew model predictions by virtue of favoring grammatically correct, albeit potentially incorrect, answers.
Nonsensical Constructions: Both prompts and multiple-choice options were often found to be nonsensical. While nonsensical constructions should theoretically only characterize incorrect choices meant to test model reasoning, their presence even in correct answers undermines the benchmark's accuracy.
Ambiguity in Correct Responses: The study highlights instances where multiple answer options could be considered equally plausible, contravening the benchmark's design premise of having one definitively correct answer.
Length Bias: There is a noted correlation between the length of an answer and its likelihood of being chosen by a model, suggesting length-based bias rather than logic-driven correctness.

The authors employ a robust methodological framework to underpin their analysis, utilizing both language model annotations and predictions. These assessments include zero-prompt evaluations, which illustrate that a significant subset of questions can be answered without context, thus questioning the benchmark's validity in assessing common-sense reasoning. The zero-prompt evaluation showed that for an average of 68% of the model predictions, the question prompt did not alter the outcome, implicating that the benchmark is not testing the intended capability.

Proposal of GoldenSwag

In light of these findings, Chizhov and colleagues propose the development of GoldenSwag, a refined subset of HellaSwag. GoldenSwag addresses the identified issues by filtering questions based on grammar, plausibility, ethical content, and length variability, thus aiming to provide a more reliable basis for evaluating model reasoning over diverse, coherent scenarios.

Implications for AI and Future Directions

The implications of this study unfold on multiple fronts. Practically, the findings urge the NLP community to re-evaluate reliance on traditional benchmarks such as HellaSwag, advocating for benchmarks that truly measure the skills they purport to evaluate. Theoretically, the study stresses the importance of construct validity in benchmark design, aligning with principles of sound empirical evaluation in AI research. Furthermore, this paper calls for the advancement of more sophisticated benchmarks that evolve alongside language model capabilities, ensuring that they remain challenging and informative.

For future research, the paper suggests continued scrutiny of existing benchmarks and development of new ones that embrace nuanced assessments of common-sense reasoning, potentially incorporating dimensions like ethical reasoning and abstract problem solving. As AI models evolve, ensuring these benchmarks adapt to test emergent capabilities will be crucial.

In conclusion, "What the HellaSwag?" confronts significant flaws in common-sense reasoning benchmarks, advocating for a rigorous revisiting of how model capabilities are assessed. The authors' introduction of GoldenSwag could provide a blueprint for future benchmarks, fostering a deeper, more coherent understanding of large language model proficiencies.