Explainable Verbal Reasoner Plus (EVR+): A Natural Language Reasoning Framework that Supports Diverse Compositional Reasoning

Published 28 Apr 2023 in cs.CL and cs.AI | (2305.00061v1)

Abstract: Languages models have been successfully applied to a variety of reasoning tasks in NLP, yet the LLMs still suffer from compositional generalization. In this paper we present Explainable Verbal Reasoner Plus (EVR+), a reasoning framework that enhances LLMs' compositional reasoning ability by (1) allowing the model to explicitly generate and execute symbolic operators, and (2) allowing the model to decompose a complex task into several simpler ones in a flexible manner. Compared with its predecessor Explainable Verbal Reasoner (EVR) and other previous approaches adopting similar ideas, our framework supports more diverse types of reasoning such as nested loops and different types of recursion. To evaluate our reasoning framework, we build a synthetic dataset with five tasks that require compositional reasoning. Results show that our reasoning framework can enhance the LLM's compositional generalization performance on the five tasks, using a fine-tuned LLM. We also discussed the possibility and the challenges to combine our reasoning framework with a few-shot prompted LLM.

Abstract PDF Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces EVR+ which explicitly generates and executes symbolic operations to decompose complex tasks into simpler steps.
The framework leverages iterative cycles with episodic memory, external memory, and a local variable buffer to boost compositional generalization and data efficiency.
EVR+ outperforms end-to-end models on synthetic tasks by effectively handling nested loops, conditions, and recursion, enhancing performance on unseen tasks.

Explainable Verbal Reasoner Plus (EVR+) (2305.00061) is a natural language reasoning framework designed to enhance LLMs' ability to perform diverse compositional reasoning tasks. LLMs often struggle with compositional generalization, meaning they fail to perform well on complex tasks composed of simpler components when trained only on simpler examples. EVR+ tackles this by allowing the model to explicitly generate and execute symbolic operations and decompose complex tasks into simpler, manageable steps in a flexible manner. This framework is an evolution of the earlier Explainable Verbal Reasoner (EVR) but supports a wider variety of reasoning types, including nested loops, conditions, list operations, and different forms of recursion.

The core of the EVR+ framework operates in cycles. In each cycle, a Program Generator receives the current state from an episodic memory, generates a short program, and a Program Interpreter parses and executes this program. The execution results in updates to the episodic memory, a local variable buffer, and potentially an external memory. This iterative process continues until the problem is solved.

Three types of memory are utilized:

Episodic Memory: Stores the results of executed programs and serves as the input for the Program Generator.
External Memory: Holds static information like the original context or rules, accessed by specific operators but not typically modified during problem-solving.
Local Variable Buffer: Stores temporary variables used only during the execution of a specific program step.

The Program Generator is implemented using a LLM (UnifiedQA-T5-base in the experiments). It takes the textual content of the episodic memory and outputs a program string. The generated programs utilize a defined set of symbolic operations, which include:

Constants (numbers, strings, booleans)
Local Variables
Control Flow (for loops, while loops, if-else statements)
Memory Operations (read/write to episodic/external memory, clearing memory, starting new recursion with new_mem(), returning results with return())
Utility functions (like appending to lists)
External Tools (modules like qa for question answering and rewrite for text transformation)

The Program Interpreter is responsible for parsing the generated program string into executable instructions and executing them sequentially. The interpreter is designed to handle the specific syntax and semantics of the defined operations. For instance, a loop construct for #1 in #0; ... end_for; iterates through a list stored in variable #0, assigning each element to #1 for processing within the loop body. The new_mem(#0) operation is a key feature enabling recursive decomposition; it creates a new execution context (with a new episodic memory initialized from #0) for solving a sub-problem.

To evaluate EVR+'s practical performance on compositional reasoning, the authors introduced the SynthCompR dataset, consisting of five synthetic tasks:

Chaining: Requires tracking changes to an object's state through a sequence of events (e.g., tracking item quantities transferred between people).
Cartesian Product: Requires enumerating all combinations of elements from multiple lists (e.g., listing all items each person in a group possesses).
Tree Search: Involves logical deduction by applying rules to facts, potentially requiring backtracking to find a valid proof path (e.g., proving a statement using if-then rules and initial facts).
Chaining Tree Search: Combines chaining and tree search, requiring initial state tracking followed by logical deduction based on the final states.
Cartesian Tree Search: Combines Cartesian product and tree search, requiring the enumeration of facts derived from a Cartesian product statement followed by logical deduction.

These tasks are structured with varying "depths" to measure compositional generalization, where higher depth implies a more complex problem structure requiring more reasoning steps or deeper composition of operations.

Implementation of EVR+ involves training the LLM used as the Program Generator and potentially other tools (like qa and rewrite) to produce the correct sequence of programs and intermediate results given the episodic memory. The paper describes a training approach using hand-crafted rules and templates to generate paired data (episodic memory state, desired program/output) for various interaction patterns within the framework (e.g., generate_program, qa, rewrite). A single UnifiedQA-T5-base model was fine-tuned on a mixed dataset containing examples for all these patterns.

The experimental results on SynthCompR demonstrate practical advantages of EVR+ compared to an end-to-end trained UnifiedQA-T5-large baseline:

Improved Generalization: EVR+ showed considerably better compositional generalization to unseen depths (out-of-domain data) with less degradation in performance compared to the end-to-end model, even when trained on significantly fewer examples.
Data Efficiency: EVR+ was less data-hungry for both learning individual tasks (requiring only 500 examples for good performance on chaining/tree search) and transferring to combined tasks (performing comparably or better than the end-to-end model with significantly fewer fine-tuning examples on chaining-tree-search and Cartesian-tree-search).
Compatibility with LLMs: An exploration into using few-shot prompted GPT-3 for the Program Generator showed potential, with many interaction patterns achieving high accuracy. However, generating grammatically correct programs for the custom language proved challenging with few examples, suggesting that adapting LLMs may require modifications or more specific training data.

While promising, the practical deployment of EVR+ has limitations. The framework was evaluated only on synthetic tasks; real-world applications might face challenges with noisy or ambiguous natural language input, potentially leading to cascading errors in the reasoning steps. Furthermore, the iterative nature of executing multiple program cycles per problem significantly increases inference time compared to a single forward pass of an end-to-end model. For example, a depth 4 chaining problem required running the LLM 26-27 times.

In summary, EVR+ offers a practical approach to improving compositional reasoning in LLMs by externalizing the reasoning process into generated and executed programs. Its ability to handle diverse operations and flexible decomposition allows it to generalize better and learn more efficiently on complex, structured tasks compared to standard end-to-end training, albeit with increased computational cost during inference. Future work could explore bridging the gap to real-world data and optimizing inference efficiency.

Markdown Report Issue