Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?

Published 19 Feb 2024 in cs.CL | (2402.12483v2)

Abstract: Multiple-choice question answering (MCQA) is often used to evaluate LLMs. To see if MCQA assesses LLMs as intended, we probe if LLMs can perform MCQA with choices-only prompts, where models must select the correct answer only from the choices. In three MCQA datasets and four LLMs, this prompt bests a majority baseline in 11/12 cases, with up to 0.33 accuracy gain. To help explain this behavior, we conduct an in-depth, black-box analysis on memorization, choice dynamics, and question inference. Our key findings are threefold. First, we find no evidence that the choices-only accuracy stems from memorization alone. Second, priors over individual choices do not fully explain choices-only accuracy, hinting that LLMs use the group dynamics of choices. Third, LLMs have some ability to infer a relevant question from choices, and surprisingly can sometimes even match the original question. Inferring the original question is an impressive reasoning strategy, but it cannot fully explain the high choices-only accuracy of LLMs in MCQA. Thus, while LLMs are not fully incapable of reasoning in MCQA, we still advocate for the use of stronger baselines in MCQA benchmarks, the design of robust MCQA datasets for fair evaluations, and further efforts to explain LLM decision-making.

Abstract PDF HTML Upgrade to Chat

References (62)

Citations (14)

View on Semantic Scholar

Summary

The paper demonstrates that LLMs achieve notable accuracy gains on MCQA tasks by leveraging the intrinsic dynamics among answer choices.
Experiments on ARC, MMLU, and HellaSwag reveal that 'choices-only' prompts enable robust performance without relying on memorization.
The study shows that LLMs can infer implicit questions through abductive reasoning, urging a reevaluation of standard MCQA benchmarks.

Insights into LLM Performance on MCQA Without Questions

The paper, "Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?", presents an intriguing exploration into the capabilities of LLMs in multiple-choice question answering (MCQA). This study critically examines a commonly used evaluation framework for LLMs, investigating whether these models can succeed in MCQA tasks even when deprived of the question prompts.

Key Findings

The researchers conducted experiments using three prominent MCQA datasets: ARC, MMLU, and HellaSwag, alongside four LLMs: LLaMA-2, Falcon, Phi-2, and Mixtral. Remarkably, the models' performance with "choices-only" prompts—where only answer options were provided—surpassed majority baselines in 11 out of 12 scenarios, demonstrating accuracy gains up to 0.33. This suggests that LLMs might leverage specific dynamics in the choices themselves for decision-making.

Three primary hypotheses were explored to explain these results:

Memorization: The study found no substantial evidence indicating that high accuracy in choices-only settings stemmed from memorization of seen examples alone. Models equipped with prompts void of discriminative information failed to exhibit significant performance, debunking the notion of memorization as the primary factor.
Choice Dynamics: Examination into how models leverage individual priors (favoring certain words or patterns) and collective dynamics (considering the relationship between all options) revealed that individual priors weren't sufficient to account for the observed accuracy. This finding suggests that LLMs may engage in more complex reasoning processes when selecting answers based on group dynamics.
Abductive Question Inference (AQI): LLMs were shown to possess a degree of capability to infer questions from choices—sometimes resembling the original questions—indicating a potential to engage in abductive reasoning. The performance of LLMs was on par when they generated and answered their inferred questions, and in some cases, even exceeded the choices-only prompt results.

Implications and Future Directions

The results have significant implications for the design and evaluation of MCQA datasets and LLMs. Current benchmarks may inadvertently assess model capabilities not originally intended, such as exploiting dataset artifacts rather than demonstrating comprehension or reasoning skills. Consequently, this necessitates stronger baselines and robust dataset creation protocols to mitigate artifact exploitation.

Moreover, the findings highlight the importance of understanding how LLMs are making decisions, especially in partial-input settings. The study presents an advanced methodological framework, which should encourage further investigations into whether more sophisticated reasoning abilities can be encouraged or detected in LLMs using other strategies.

Conclusion

Overall, this paper provides a nuanced assessment of how artifacts and reasoning interplay in LLMs' performance on MCQA tasks. It emphasizes the need for transparency in LLM evaluations and prompts a reevaluation of methodologies to better align with the intended assessment of model capabilities. Future research should continue to explore these dynamics, ideally leading to advanced models capable of achieving more consistent and interpretable performance across varied MCQA settings.