WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging

Published 25 Feb 2025 in cs.CL | (2502.18316v1)

Abstract: We introduce WiCkeD, a simple method to increase the complexity of existing multiple-choice benchmarks by randomly replacing a choice with "None of the above", a method often used in educational tests. We show that WiCkeD can be automatically applied to any existing benchmark, making it more challenging. We apply WiCkeD to 6 popular benchmarks and use it to evaluate 18 open-weight LLMs. The performance of the models drops 12.1 points on average with respect to the original versions of the datasets. When using chain-of-thought on 3 MMLU datasets, the performance drop for the WiCkeD variant is similar to the one observed when using the LLMs directly, showing that WiCkeD is also challenging for models with enhanced reasoning abilities. WiCkeD also uncovers that some models are more sensitive to the extra reasoning required, providing additional information with respect to the original benchmarks. We relase our code and data at https://github.com/ahmedselhady/wicked-benchmarks.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

An Analysis of WiCkeD: Enhancing Multiple Choice Benchmarks for Evaluating LLM Competency

The academic paper presents a novel mechanism named WiCkeD (Wi(ld)-Card Distractor), aimed at increasing the difficulty level of existing multiple-choice question (MCQ) benchmarks, which are commonly used to evaluate large language models (LLMs). The authors propose a straightforward yet impactful modification to the traditional MCQ format by substituting one of the choices in each question with "None of the above." This adjustment, often seen in educational testing, augments the benchmark's complexity by demanding enhanced reasoning from LLMs to identify when the correct answer is omitted from the options.

According to the research, the WiCkeD method can be seamlessly integrated into existing benchmarks, thereby obviating the need for creating entirely new datasets. The authors applied this method to six popular MCQ benchmarks and evaluated the performance of eighteen open-weight LLMs, revealing an average performance drop of 12.1 percentage points when compared to the original datasets. Notably, this performance degradation persists even when employing advanced techniques such as chain-of-thought reasoning, indicating the increased challenge posed by this approach.

The work is contextualized within the ongoing discourse about the limitations of current MCQ benchmarks for LLMs. Previous studies have remarked upon models accurately answering MCQs by leveraging option biases or hallucinations without truly understanding the content. Through WiCkeD, the authors contribute to this discussion by providing a tool that alleviates some of these issues, facilitating more accurate assessment of an LLM's reasoning and understanding capabilities.

An important component of the research involves distinguishing between Single Best Answer (SBA) and Single Correct Answer (SCA) frameworks within benchmarks. The authors implement an automated classification system, driven by a BERT-based model, to identify SBA questions accurately, which are not suitable for WiCkeD transformation due to potential incoherence introduced by the "None of the above" option.

The results of the study offer considerable insights. For instance, models with recognized enhanced reasoning capabilities, such as DS-R1 variations of Qwen and Llama models, exhibit less performance degradation under the WiCkeD paradigm. These findings indicate that the WiCkeD method not only increases the challenge level but also unveils notable differences in model robustness that might remain undetected in traditional settings.

The paper concludes by emphasizing the applicability of WiCkeD as a supplementary evaluation measure for LLMs. It posits that the substantial variability in performance among models reflects the nuanced insight WiCkeD provides into LLM capabilities, especially regarding their ability to process absent correct answers in a given option set. However, the authors acknowledge the limitations related to current software libraries in detecting SBA questions and the necessity for further exploration into the impact of WiCkeD on closed models like GPT-4.

Looking forward, WiCkeD's contribution is significant in both theoretical and practical realms of AI development. It can serve as a benchmark modification that better reflects real-world scenarios where answers might not exist in predefined choices, thus encouraging the development of LLMs with superior reasoning proficiency. Moreover, the methodological simplicity of WiCkeD allows for broad adaptability, which could foster more robust benchmarking standards in the evolving landscape of AI evaluation.