Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?
Abstract: Multiple-choice question answering (MCQA) is often used to evaluate LLMs. To see if MCQA assesses LLMs as intended, we probe if LLMs can perform MCQA with choices-only prompts, where models must select the correct answer only from the choices. In three MCQA datasets and four LLMs, this prompt bests a majority baseline in 11/12 cases, with up to 0.33 accuracy gain. To help explain this behavior, we conduct an in-depth, black-box analysis on memorization, choice dynamics, and question inference. Our key findings are threefold. First, we find no evidence that the choices-only accuracy stems from memorization alone. Second, priors over individual choices do not fully explain choices-only accuracy, hinting that LLMs use the group dynamics of choices. Third, LLMs have some ability to infer a relevant question from choices, and surprisingly can sometimes even match the original question. Inferring the original question is an impressive reasoning strategy, but it cannot fully explain the high choices-only accuracy of LLMs in MCQA. Thus, while LLMs are not fully incapable of reasoning in MCQA, we still advocate for the use of stronger baselines in MCQA benchmarks, the design of robust MCQA datasets for fair evaluations, and further efforts to explain LLM decision-making.
- Phi-2: The surprising power of small language models. Microsoft Research Blog.
- When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. arXiv preprint arXiv:2402.01781.
- It’s not easy being wrong: Evaluating process of elimination reasoning in large language models. arXiv preprint arXiv:2311.07532.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
- Don’t take the premise for granted: Mitigating artifacts in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 877–891, Florence, Italy. Association for Computational Linguistics.
- A comparative study of measures of partial knowledge in multiple-choice tests. Applied Psychological Measurement, 21(1):65–88.
- Abductive commonsense reasoning. In International Conference on Learning Representations.
- Extracting training data from large language models. In USENIX Security Symposium.
- Dhawaleswar Rao Ch and Sujan Kumar Saha. 2018. Automatic multiple choice question generation from text: A survey. IEEE Transactions on Learning Technologies, 13(1):14–25.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
- Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46.
- Shortcut learning of large language models in natural language understanding. Communications of the ACM, 67(1):110–120.
- Measuring and improving attentiveness to partial inputs with counterfactuals. arXiv preprint arXiv:2311.09605.
- Misleading failures of partial-input baselines. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5533–5538, Florence, Italy. Association for Computational Linguistics.
- Evaluating models’ local decision boundaries via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1307–1323, Online. Association for Computational Linguistics.
- Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1161–1166, Hong Kong, China. Association for Computational Linguistics.
- Breaking NLI systems with sentences that require simple lexical inferences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 650–655, Melbourne, Australia. Association for Computational Linguistics.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913.
- Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana. Association for Computational Linguistics.
- Measuring massive multitask language understanding. In International Conference on Learning Representations.
- Christine Herlihy and Rachel Rudinger. 2021. MedNLI is not immune: Natural language inference artifacts in the clinical domain. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 1020–1027, Online. Association for Computational Linguistics.
- Natural language decompositions of implicit content enable better text representations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13188–13214, Singapore. Association for Computational Linguistics.
- Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1049–1065, Toronto, Canada. Association for Computational Linguistics.
- Are large pre-trained language models leaking your personal information? In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2038–2047, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Shotaro Ishihara. 2023. Training data extraction from pre-trained language models: A survey. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 260–275, Toronto, Canada. Association for Computational Linguistics.
- Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031, Copenhagen, Denmark. Association for Computational Linguistics.
- Mixtral of experts.
- Maieutic prompting: Logically consistent reasoning with recursive explanations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1266–1279, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Divyansh Kaushik and Zachary C. Lipton. 2018. How much reading does reading comprehension require? a critical investigation of popular benchmarks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5010–5015, Brussels, Belgium. Association for Computational Linguistics.
- The winograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.
- Holistic evaluation of language models. Transactions on Machine Learning Research. Featured Certification, Expert Certification.
- WANLI: Worker and AI collaboration for natural language inference dataset creation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6826–6847, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Chenkai Ma and Xinya Du. 2023. POE: Process of elimination for multiple choice reasoning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4487–4496, Singapore. Association for Computational Linguistics.
- Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Annual Meeting of the Association for Computational Linguistics.
- Michail Mersinias and Panagiotis Valvis. 2022. Mitigating dataset artifacts in natural language inference through automatic contextual data augmentation and learning optimization. In International Conference on Language Resources and Evaluation.
- Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- TextAttack: A framework for adversarial attacks, data augmentation, and adversarial training in NLP. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 119–126, Online. Association for Computational Linguistics.
- What in-context learning "learns" in-context: Disentangling task recognition and task learning. In Annual Meeting of the Association for Computational Linguistics.
- Don’t blame the annotator: Bias already starts in the annotation instructions. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1779–1789, Dubrovnik, Croatia. Association for Computational Linguistics.
- Charles Sanders Peirce. 1974. Collected papers of charles sanders peirce, volume 1. Harvard University Press.
- The refinedweb dataset for falcon LLM: Outperforming curated corpora with web data only. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Pouya Pezeshkpour and Estevam Hruschka. 2023. Large language models sensitivity to the order of options in multiple-choice questions. ArXiv, abs/2308.11483.
- Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 180–191, New Orleans, Louisiana. Association for Computational Linguistics.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
- When and why does bias mitigation work? In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9233–9247, Singapore. Association for Computational Linguistics.
- Joshua Robinson and David Wingate. 2023. Leveraging large language models for multiple choice question answering. In The Eleventh International Conference on Learning Representations.
- Kenneth D Royal and Mari-Wells Hedgpeth. 2015. A novel method for evaluating examination item quality. International Journal of Psychological Studies, 7(1):17.
- Kenneth D Royal and Myrah R Stockdale. 2017. The impact of 3-option responses to multiple-choice questions on guessing strategies and cut score determinations. Journal of Advances in Medical Education & Professionalism, 5(2):84.
- Did chatgpt cheat on your test?
- What do we expect from multiple-choice QA systems? In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3547–3553, Online. Association for Computational Linguistics.
- Neha Srikanth and Rachel Rudinger. 2022. Partial-input baselines show that NLI models can ignore context, but they don’t. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4753–4763, Seattle, United States. Association for Computational Linguistics.
- Assessing the benchmarking capacity of machine reading comprehension datasets. In AAAI Conference on Artificial Intelligence.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. ArXiv, abs/2305.04388.
- Universal adversarial triggers for attacking and analyzing nlp. In Conference on Empirical Methods in Natural Language Processing.
- Beyond the answers: Reviewing the rationality of multiple choice question answering for the evaluation of large language models. arXiv preprint arXiv:2402.01349.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Sarah Wiegreffe and Ana Marasović. 2021. Teach me to explain: A review of datasets for explainable natural language processing. In NeurIPS Datasets and Benchmarks.
- Large language model as attributed training data generator: A tale of diversity and bias. In Thirty-Seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Hellaswag: Can a machine really finish your sentence? In Annual Meeting of the Association for Computational Linguistics.
- Large language models are not robust multiple choice selectors. In The Twelfth International Conference on Learning Representations.
- Don’t make your llm an evaluation benchmark cheater. ArXiv, abs/2311.01964.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.