On Memorization of Large Language Models in Logical Reasoning
Abstract: LLMs achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes. This contrasting behavior is puzzling when it comes to understanding the mechanisms behind LLMs' reasoning capabilities. One hypothesis is that the increasingly high and nearly saturated performance on common reasoning benchmarks could be due to the memorization of similar problems. In this paper, we systematically investigate this hypothesis with a quantitative measurement of memorization in reasoning tasks, using a dynamically generated logical reasoning benchmark based on Knights and Knaves (K&K) puzzles. We find that LLMs could interpolate and memorize the training puzzles (achieving near-perfect accuracy) after fine-tuning, yet they struggle with slight variations of these puzzles. On the other hand, we show that while fine-tuning leads to heavy memorization, it also consistently improves generalization performance. Through in-depth analyses with perturbation tests, cross difficulty-level transferability, probing model internals, and fine-tuning with wrong answers, we establish that LLMs develop reasoning skills on K&K puzzles alongside memorization. Finally, our analysis based on a per-sample memorization score sheds light on how LLMs switch between reasoning and memorization when solving logical puzzles. Our code and data are available at https://memkklogic.github.io.
- Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. In ICLR, 2017.
- Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs. arXiv preprint arXiv:2402.03927, 2024.
- Deep learning: a statistical viewpoint. Acta Numerica, 2021.
- Mikhail Belkin. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. Acta Numerica, 2021.
- Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate. In NeurIPS, 2018.
- Emergent and predictable memorization in large language models. NeurIPS, 2024.
- Extracting training data from large language models. In USENIX Security, 2021.
- Quantifying memorization across neural language models. In ICLR, 2023.
- Premise order matters in reasoning with large language models. In ICML, 2024.
- Scaling instruction-finetuned language models. Journal of Machine Learning Research, 2024.
- Transformers as soft reasoners over language. In IJCAI, 2020.
- What you can cram into a single vector: Probing sentence embeddings for linguistic properties. In ACL, 2018.
- Implicit chain of thought reasoning via knowledge distillation. arXiv preprint arXiv:2311.01460, 2023.
- From explicit cot to implicit cot: Learning to internalize cot step by step. arXiv preprint arXiv:2405.14838, 2024.
- Faith and fate: Limits of transformers on compositionality. NeurIPS, 2024.
- Puzzle solving using reasoning of large language models: A survey. In IJCAI, 2024.
- Data contamination quiz: A tool to detect and estimate contamination in large language models. arXiv preprint arXiv:2311.06233, 2023.
- Changing answer order can decrease MMLU accuracy. arXiv preprint arXiv:2406.19470, 2024.
- SoK: memorization in general-purpose large language models. arXiv preprint arXiv:2310.18362, 2023.
- Fantastic copyrighted beasts and how (not) to generate them. arXiv preprint arXiv:2406.14526, 2024.
- Designing and interpreting probes with control tasks. In EMNLP, 2019.
- Large language models are reasoning teachers. In ACL, 2023.
- Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In ACL, 2023.
- LiveCodeBench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024.
- A peek into token bias: Large language models are not yet genuine reasoners. EMNLP, 2024.
- Meta-logical problems: Knights, knaves, and rips. Cognition, 1990.
- Copyright violations and large language models. In EMNLP, 2023.
- BoardgameQA: A dataset for natural language reasoning with contradictory information. In NeurIPS, 2024.
- The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning. In EMNLP, 2023.
- A mechanistic understanding of alignment algorithms: A case study on DPO and toxicity. In ICML, 2024.
- Deduplicating training data makes language models better. In ACL, 2022.
- Bill Yuchen Lin. Math Olympiad becomes easier for AI; Common sense is still hard., 2024. URL https://x.com/billyuchenlin/status/1812948314360541302.
- ZebraLogic: benchmarking the logical reasoning ability of language models, 2024. URL https://hf.co/spaces/allenai/ZebraLogic.
- Analyzing leakage of personally identifiable information in language models. In IEEE Symposium on Security and Privacy (SP), 2023.
- Data contamination: From memorization to exploitation. arXiv preprint arXiv:2203.08242, 2022.
- Embers of autoregression show how large language models are shaped by the problem they are trained to solve. Proceedings of the National Academy of Sciences, 121(41):e2322420121, 2024.
- Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229, 2024.
- Liar, liar, logical mire: A benchmark for suppositional reasoning in large language models. arXiv preprint arXiv:2406.12546, 2024.
- Harmless interpolation of noisy data in regression. IEEE Journal on Selected Areas in Information Theory, 2020.
- Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models. arXiv preprint arXiv:2406.02061, 2024.
- Proving test set contamination in black box language models. ICLR, 2024.
- LogicBench: towards systematic evaluation of logical reasoning ability of large language models. In ACL, 2024.
- Deciphering the factors influencing the efficacy of chain-of-thought: Probability, memorization, and noisy reasoning. arXiv preprint arXiv:2407.01687, 2024.
- Fine-tuning with divergent chains of thought boosts reasoning through self-correction in language models. arXiv preprint arXiv:2407.03181, 2024.
- Impact of pretraining term frequencies on few-shot numerical reasoning. In Findings of EMNLP 2022, pp. 840–854, 2022.
- To the cutoff… and beyond? A longitudinal perspective on LLM data contamination. In ICLR, 2023.
- Leveraging large language models for multiple choice question answering. In ICLR, 2023.
- Detecting pretraining data from large language models. In ICLR, 2024.
- Raymond Smullyan. What is the Name of this Book? Prentice-Hall, 1978.
- Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap. arXiv preprint arXiv:2402.19450, 2024.
- Memorization without overfitting: Analyzing the training dynamics of large language models. NeurIPS, 2022.
- The instruction hierarchy: Training LLMs to prioritize privileged instructions. arXiv preprint arXiv:2404.13208, 2024.
- DecodingTrust: a comprehensive assessment of trustworthiness in GPT models. In NeurIPS, 2023.
- Beyond the answers: Reviewing the rationality of multiple choice question answering for the evaluation of large language models. arXiv preprint arXiv:2402.01349, 2024.
- Assessing the brittleness of safety alignment via pruning and low-rank modifications. In ICML, 2024a.
- Evaluating copyright takedown methods for language models. In NeurIPS Datasets and Benchmark, 2024b.
- Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022.
- ConceptMix: A compositional image generation benchmark with controllable difficulty. In NeurIPS Datasets and Benchmark, 2024a.
- Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 1819–1862, 2024b.
- Benchmarking benchmark leakage in large language models. arXiv preprint arXiv:2404.18824, 2024.
- Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850, 2023.
- Data contamination can cross language barriers. arXiv preprint arXiv:2406.13236, 2024.
- Physics of language models: Part 2.1, grade-school math and the hidden reasoning process. arXiv preprint arXiv:2407.20311, 2024.
- Star: Bootstrapping reasoning with reasoning. NeurIPS, 35:15476–15488, 2022.
- MR-GSM8K: a meta-reasoning benchmark for large language model evaluation. arXiv preprint arXiv:2312.17080, 2023.
- A careful examination of large language model performance on grade school arithmetic. arXiv preprint arXiv:2405.00332, 2024.
- Calibrate before use: Improving few-shot performance of language models. In ICML, 2021.
- Revisiting the self-consistency challenges in multi-choice question formats for large language model evaluation. In LREC-COLING, 2024.
- Dyval: Graph-informed dynamic evaluation of large language models. In ICLR, 2024.
- Fool your (vision and) language model with embarrassingly simple permutations. ICML, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.