ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning

Published 7 Feb 2025 in cs.CL, cs.AI, and cs.LG | (2502.04689v3)

Abstract: LLMs have demonstrated impressive capabilities on complex evaluation benchmarks, many of which are formulated as question-answering (QA) tasks. Enhancing the performance of LLMs in QA contexts is becoming increasingly vital for advancing their development and applicability. This paper introduces ARR, an intuitive, effective, and general QA solving method that explicitly incorporates three key steps: analyzing the intent of the question, retrieving relevant information, and reasoning step by step. Notably, this paper is the first to introduce intent analysis in QA, which plays a vital role in ARR. Comprehensive evaluations across 10 diverse QA tasks demonstrate that ARR consistently outperforms the baseline methods. Ablation and case studies further validate the positive contributions of each ARR component. Furthermore, experiments involving variations in prompt design indicate that ARR maintains its effectiveness regardless of the specific prompt formulation. Additionally, extensive evaluations across various model sizes, LLM series, and generation settings solidify the effectiveness, robustness, and generalizability of ARR.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces Analyzing, Retrieving, and Reasoning (ARR), a novel zero-shot prompting method guiding LLMs through explicit steps to improve question answering performance.
Evaluated across 10 diverse QA datasets, ARR consistently outperforms baseline and zero-shot CoT methods, showing an average improvement of +4.1% over the baseline.
Ablation studies confirm each ARR component contributes positively, with 'Analyzing' yielding the largest gains, and the method generalizes across various LLMs and configurations.

The paper introduces the Analyzing, Retrieving, and Reasoning (ARR) method, a novel zero-shot prompting technique designed to enhance the performance of LLMs in question-answering (QA) tasks. The ARR method explicitly incorporates three key steps: analyzing the intent of the question, retrieving relevant information, and step-by-step reasoning. The authors posit that by guiding LLMs through these steps, the ARR prompt can improve accuracy across various QA tasks.

The ARR method is motivated by the observation that while zero-shot Chain-of-Thought (CoT) prompting improves reasoning in LLMs, it only provides vague guidance. The authors argue that answering complex questions involves understanding the question's intent, retrieving relevant information, and applying inductive and deductive reasoning. The ARR method operationalizes this by using the prompt: "Let's analyze the intent of the question, find relevant information, and answer the question with step-by-step reasoning."

The method's efficacy was evaluated using open-weights LLMs on 10 multiple-choice QA datasets, encompassing reading comprehension, commonsense reasoning, world knowledge, and multitask understanding. These datasets include BoolQ, LogiQA, CommonsenseQA (CSQA), SocialIQA (SIQA), SciQ, OpenBookQA (OBQA), AI2 Reasoning Challenge (ARC), BIG-Bench Hard (BBH), Multitask Language Understanding (MMLU), and MMLU-Pro.

The paper details the two-stage process used for QA: reasoning generation and option selection. In the first stage, the LLM generates a response $r_i$ given the input prompt $x_i$ :

$r_i = \mathcal{M}(\tilde{x}_i)$

where:

$\mathcal{M}$ is the LLM
$\tilde{x}_i$ is the tokenized representation of the input text $x_i$

In the second stage, the original input $x_i$ , the generated response $r_i$ , and each choice $o_i^j$ are combined to form a new prompt $z_i^j$ :

$z_i^j = \mathbf{P}(x_i, r_i, o_i^j)$

where:

$\mathbf{P}$ is a prompt function that concatenates the string objects

The cross-entropy loss $\mathcal{L}_i^j$ for each $z_i^j$ is then computed:

$\mathcal{L}_i^j = - \sum_{k} \log \text{Pr}(t_i^{j;k} | t_i^{j;<k}; \Theta)$

where:

$t_i^{j;k}$ is the $k$ -th token
$\Theta$ represents the parameters of $\mathcal{M}$

The option with the lowest loss is selected as the answer:

$\hat{y}_i = \underset{j \in \{1, 2, \dots, m\}}{\arg\min} \, \{\mathcal{L}_i^j\}_{j=1}^{m}$

The overall accuracy $\alpha$ is then calculated as:

$\alpha = \frac{1}{n} \sum_{i=1}^{n} \mathbb{I}(y_i = \hat{y}_i)$

where:

$\mathbb{I}$ is an indicator function that returns $1$ if $y_i = \hat{y}_i$ and $0$ otherwise

The main results show that ARR consistently improves QA performance across all datasets compared to a baseline method without a specific trigger sentence and the zero-shot CoT method. For example, ARR achieves an average improvement of $+4.1\%$ over the baseline.

Ablation studies were conducted to assess the contribution of each component of ARR (Analyzing, Retrieving, and Reasoning). The results indicate that each component individually outperforms both the Baseline and CoT methods, confirming their positive impact. The Analyzing-only setting yielded the largest performance gain on average, suggesting that intent analysis plays a critical role in question answering.

The generalizability of ARR was evaluated across various settings, including different model sizes, LLM series, generation temperatures, and few-shot scenarios. The LLMs used in these experiments include LLaMA3-8B-Chat, LLaMA3-Chat models (1B and 3B parameters), Qwen2.5, Gemma, and Mistral. ARR consistently outperformed alternative methods across these diverse configurations.

The authors also present a case study illustrating how ARR can lead to correct answers by avoiding intent misunderstanding, context misuse, and faulty reasoning, which can occur in Baseline and CoT methods.

In summary, the paper makes the following claims:

ARR is an effective zero-shot prompting method that improves LLM performance in various question-answering tasks.
ARR consistently outperforms the Baseline and CoT methods, with each component (Analyzing, Retrieving, and Reasoning) contributing positively.
ARR's effectiveness and generalizability are validated across different model sizes, LLM series, and generation configurations.