On Memorization of Large Language Models in Logical Reasoning

Published 30 Oct 2024 in cs.CL | (2410.23123v2)

Abstract: LLMs achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes. This contrasting behavior is puzzling when it comes to understanding the mechanisms behind LLMs' reasoning capabilities. One hypothesis is that the increasingly high and nearly saturated performance on common reasoning benchmarks could be due to the memorization of similar problems. In this paper, we systematically investigate this hypothesis with a quantitative measurement of memorization in reasoning tasks, using a dynamically generated logical reasoning benchmark based on Knights and Knaves (K&K) puzzles. We find that LLMs could interpolate and memorize the training puzzles (achieving near-perfect accuracy) after fine-tuning, yet they struggle with slight variations of these puzzles. On the other hand, we show that while fine-tuning leads to heavy memorization, it also consistently improves generalization performance. Through in-depth analyses with perturbation tests, cross difficulty-level transferability, probing model internals, and fine-tuning with wrong answers, we establish that LLMs develop reasoning skills on K&K puzzles alongside memorization. Finally, our analysis based on a per-sample memorization score sheds light on how LLMs switch between reasoning and memorization when solving logical puzzles. Our code and data are available at https://memkklogic.github.io.

Abstract PDF HTML Upgrade to Chat

References (70)

Summary

The paper introduces a novel memorization score that quantifies LLM performance differences on perturbed logical reasoning tasks.
It demonstrates that fine-tuning boosts generalization, even as models rely heavily on memorization for familiar puzzles.
Experiments on 11 models reveal that only advanced LLMs effectively balance memorization with genuine reasoning capabilities.

Understanding Memorization and Reasoning in LLMs

The paper "On Memorization of LLMs in Logical Reasoning" explores the interplay between memorization and genuine reasoning in LLMs, particularly in the context of logical reasoning tasks. The research focuses on dissecting whether LLMs depend on memorization to solve reasoning benchmarks and how this proficiency influences their generalization abilities. Utilizing a dynamically generated benchmark based on Knights and Knaves (K) puzzles, the work provides nuanced insights into the balance between memorization and reasoning within LLMs.

The study reveals that, although LLMs can interpolate training data with near-perfect accuracy after fine-tuning, their performance degrades significantly when these puzzles are perturbed. This behavior suggests a reliance on memorization for solving familiar problems. Nevertheless, the paper also observes that fine-tuning, while leading to heavy memorization, enhances the models' ability to generalize across different tasks, implying that LLMs indeed acquire genuine reasoning capabilities alongside memorization.

Key contributions of the research include developing a memorization score that quantifies the performance inconsistency of LLMs under local perturbations. This metric distinguishes between reasoning-driven and memorization-driven problem-solving. The new K puzzle benchmark additionally supports automatic perturbation and reasoning step synthesis, enabling a robust investigation into how models handle logical reasoning under controlled conditions.

In terms of experimental findings, the paper evaluates 11 open-source models and demonstrates that only advanced models can adequately tackle the K puzzles, with substantial memorization indicated by their performance under perturbations. Fine-tuning experiments with models like Llama3-8B and GPT4o-mini further illustrate that generalization improves as the extent of memorization increases, challenging the notion that memorization is solely a learning hindrance. These results highlight that LLMs, when fine-tuned, develop an intricate balance between memorization and genuine reasoning.

Theoretical implications of this study underscore the necessity of distinguishing between memorization and reasoning in comparative analyses of LLM performance on reasoning tasks. Practically, these insights are crucial for applications requiring reliable reasoning, such as in fields where safety and trustworthiness are paramount.

Looking forward, this paper signals several potential directions for further exploration. Developing training methodologies that foster reasoning without excessive reliance on memorization remains a key challenge. Additionally, understanding the mechanisms by which LLMs toggle between reasoning and memorization when faced with perturbed tasks could lead to more robust AI systems. The dynamic benchmark introduced offers a promising springboard for such investigations, given its adaptability in generating varied reasoning scenarios.

In conclusion, the research elucidates the dual facets of LLM learning—memorization and reasoning—and provides a comprehensive framework to measure and improve reasoning capabilities in LLMs. By advancing our understanding of how these models work, particularly in logical reasoning tasks, the paper contributes significantly to both the academic discourse and practical advancements in artificial intelligence.

Markdown Report Issue