Disentangling Memory and Reasoning Ability in Large Language Models

Published 20 Nov 2024 in cs.CL | (2411.13504v3)

Abstract: LLMs have demonstrated strong performance in handling complex tasks requiring both extensive knowledge and reasoning abilities. However, the existing LLM inference pipeline operates as an opaque process without explicit separation between knowledge retrieval and reasoning steps, making the model's decision-making process unclear and disorganized. This ambiguity can lead to issues such as hallucinations and knowledge forgetting, which significantly impact the reliability of LLMs in high-stakes domains. In this paper, we propose a new inference paradigm that decomposes the complex inference process into two distinct and clear actions: (1) memory recall: which retrieves relevant knowledge, and (2) reasoning: which performs logical steps based on the recalled knowledge. To facilitate this decomposition, we introduce two special tokens memory and reason, guiding the model to distinguish between steps that require knowledge retrieval and those that involve reasoning. Our experiment results show that this decomposition not only improves model performance but also enhances the interpretability of the inference process, enabling users to identify sources of error and refine model responses effectively. The code is available at https://github.com/MingyuJ666/Disentangling-Memory-and-Reasoning.

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces a novel inference framework using <memory> and <reason> tokens to distinctly control memory recall and logical reasoning in large language models.
It demonstrates significant accuracy improvements on benchmarks such as StrategyQA (78.6%), CommonsenseQA (82.3%), and TruthfulQA (86.6%), outperforming traditional methods.
The method enhances interpretability and robustness by clarifying error sources and paving the way for adaptive memory and reasoning strategies in high-stakes domains.

Disentangling Memory and Reasoning Ability in LLMs

Introduction

This paper presents a methodological advance in the inference process of LLMs, addressing the inherent opacity of current models when separating memory recall from logical reasoning in complex tasks. By introducing a novel inference paradigm, the research delineates these processes with two special tokens: $\langle \text{memory} \rangle$ , for knowledge retrieval, and $\langle \text{reason} \rangle$ , for reasoning steps. These interventions aim to enhance model interpretability, accuracy, and robustness by clarifying the decision-making process of LLMs, which is often muddled by hallucinations and knowledge forgetfulness.

Figure 1: Workflow: The first and second steps are mainly used for training data synthesis, while the third step involves fine-tuning the LLM using data with added $\langle memory \rangle$ tokens and $\langle reason \rangle$ tokens.

Methodology

The paper proposes a three-step approach:

Data Generation: Using a powerful LLM such as GPT-4, the authors generate annotated actions that separate memory from reasoning through designated tokens. This step leverages itemized actions for various question-answering tasks, ensuring structured guidance during training.
Training with Control Tokens: A custom LLM is trained using data embedded with the special tokens $\langle \text{memory} \rangle$ and $\langle \text{reason} \rangle$ . These tokens act as control signals, systematically guiding the model in differentiating between knowledge recall and reasoning, enabling it to resolve complex queries more efficiently.
Error Analysis: The model is tested on multiple benchmarks—StrategyQA, CommonsenseQA, and TruthfulQA—to evaluate the disentanglement efficacy of memory and reasoning. The analysis identifies where errors stem from, predominantly emphasizing challenges in reasoning rather than memory recall.

Experimental Results

Significant improvement is demonstrated across benchmarks:

StrategyQA: Achieves 78.6% accuracy with LLaMA-3.1-8B, showcasing a 1.3% improvement over baseline approaches.
CommonsenseQA: Boasts an 82.3% accuracy, indicating profound enhancement over traditional CoT methods.
TruthfulQA: Surpasses GPT-4's performance, achieving 86.6% accuracy, which corroborates the utility of the proposed method in high-stakes domains.

Figure 2: Decoupling Result Comparison Between Our Algorithm and One-Shot CoT prompting on all datasets and both on LLaMA-3.1-8B.

Case Study: Attention Analysis

Attention mechanisms in enhanced LLMs reflect increased focus on specialized tokens during reasoning processes, substantiating the hypothesis that these tokens play critical roles in streamlining knowledge and inference sequences.

Figure 3: Two test examples' attention Heatmap generated by LLaMA-3.1-8B enhanced with reason and memory control tokens with the same attention head. The highlighted parts are these special tokens.

Limitations and Future Work

The research acknowledges limitations in computational overhead due to increased sequence lengths in inference tasks. Future work aims to explore dynamic memory updating and adaptive reasoning steps to further enhance model capability and efficiency.

Conclusion

The paper successfully introduces a structured framework for LLMs that demarcates memory recall and reasoning, facilitating improved accuracy and interpretability in complex tasks. This paradigm promises substantial applicability in domains requiring transparent inference processes, such as healthcare and finance, paving the way for future research into adaptive and scalable memory-reasoning mechanisms.

Figure 4: Incorrect Sample Showing: The green sections represent the questions, the steps of the answers, and the incorrect answers; the yellow areas indicate the correct answers, and the red highlights the causes of the errors.

Markdown Report Issue