Papers
Topics
Authors
Recent
Search
2000 character limit reached

ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning

Published 7 Feb 2025 in cs.CL, cs.AI, and cs.LG | (2502.04689v3)

Abstract: LLMs have demonstrated impressive capabilities on complex evaluation benchmarks, many of which are formulated as question-answering (QA) tasks. Enhancing the performance of LLMs in QA contexts is becoming increasingly vital for advancing their development and applicability. This paper introduces ARR, an intuitive, effective, and general QA solving method that explicitly incorporates three key steps: analyzing the intent of the question, retrieving relevant information, and reasoning step by step. Notably, this paper is the first to introduce intent analysis in QA, which plays a vital role in ARR. Comprehensive evaluations across 10 diverse QA tasks demonstrate that ARR consistently outperforms the baseline methods. Ablation and case studies further validate the positive contributions of each ARR component. Furthermore, experiments involving variations in prompt design indicate that ARR maintains its effectiveness regardless of the specific prompt formulation. Additionally, extensive evaluations across various model sizes, LLM series, and generation settings solidify the effectiveness, robustness, and generalizability of ARR.

Authors (2)

Summary

  • The paper introduces Analyzing, Retrieving, and Reasoning (ARR), a novel zero-shot prompting method guiding LLMs through explicit steps to improve question answering performance.
  • Evaluated across 10 diverse QA datasets, ARR consistently outperforms baseline and zero-shot CoT methods, showing an average improvement of +4.1% over the baseline.
  • Ablation studies confirm each ARR component contributes positively, with 'Analyzing' yielding the largest gains, and the method generalizes across various LLMs and configurations.

The paper introduces the Analyzing, Retrieving, and Reasoning (ARR) method, a novel zero-shot prompting technique designed to enhance the performance of LLMs in question-answering (QA) tasks. The ARR method explicitly incorporates three key steps: analyzing the intent of the question, retrieving relevant information, and step-by-step reasoning. The authors posit that by guiding LLMs through these steps, the ARR prompt can improve accuracy across various QA tasks.

The ARR method is motivated by the observation that while zero-shot Chain-of-Thought (CoT) prompting improves reasoning in LLMs, it only provides vague guidance. The authors argue that answering complex questions involves understanding the question's intent, retrieving relevant information, and applying inductive and deductive reasoning. The ARR method operationalizes this by using the prompt: "Let's analyze the intent of the question, find relevant information, and answer the question with step-by-step reasoning."

The method's efficacy was evaluated using open-weights LLMs on 10 multiple-choice QA datasets, encompassing reading comprehension, commonsense reasoning, world knowledge, and multitask understanding. These datasets include BoolQ, LogiQA, CommonsenseQA (CSQA), SocialIQA (SIQA), SciQ, OpenBookQA (OBQA), AI2 Reasoning Challenge (ARC), BIG-Bench Hard (BBH), Multitask Language Understanding (MMLU), and MMLU-Pro.

The paper details the two-stage process used for QA: reasoning generation and option selection. In the first stage, the LLM generates a response rir_i given the input prompt xix_i:

ri=M(x~i)r_i = \mathcal{M}(\tilde{x}_i)

where:

  • M\mathcal{M} is the LLM
  • x~i\tilde{x}_i is the tokenized representation of the input text xix_i

In the second stage, the original input xix_i, the generated response rir_i, and each choice oijo_i^j are combined to form a new prompt zijz_i^j:

zij=P(xi,ri,oij)z_i^j = \mathbf{P}(x_i, r_i, o_i^j)

where:

  • P\mathbf{P} is a prompt function that concatenates the string objects

The cross-entropy loss Lij\mathcal{L}_i^j for each zijz_i^j is then computed:

Lij=klogPr(tij;ktij;<k;Θ)\mathcal{L}_i^j = - \sum_{k} \log \text{Pr}(t_i^{j;k} | t_i^{j;<k}; \Theta)

where:

  • tij;kt_i^{j;k} is the kk-th token
  • Θ\Theta represents the parameters of M\mathcal{M}

The option with the lowest loss is selected as the answer:

y^i=argminj{1,2,,m}{Lij}j=1m\hat{y}_i = \underset{j \in \{1, 2, \dots, m\}}{\arg\min} \, \{\mathcal{L}_i^j\}_{j=1}^{m}

The overall accuracy α\alpha is then calculated as:

α=1ni=1nI(yi=y^i)\alpha = \frac{1}{n} \sum_{i=1}^{n} \mathbb{I}(y_i = \hat{y}_i)

where:

  • I\mathbb{I} is an indicator function that returns $1$ if yi=y^iy_i = \hat{y}_i and $0$ otherwise

The main results show that ARR consistently improves QA performance across all datasets compared to a baseline method without a specific trigger sentence and the zero-shot CoT method. For example, ARR achieves an average improvement of +4.1%+4.1\% over the baseline.

Ablation studies were conducted to assess the contribution of each component of ARR (Analyzing, Retrieving, and Reasoning). The results indicate that each component individually outperforms both the Baseline and CoT methods, confirming their positive impact. The Analyzing-only setting yielded the largest performance gain on average, suggesting that intent analysis plays a critical role in question answering.

The generalizability of ARR was evaluated across various settings, including different model sizes, LLM series, generation temperatures, and few-shot scenarios. The LLMs used in these experiments include LLaMA3-8B-Chat, LLaMA3-Chat models (1B and 3B parameters), Qwen2.5, Gemma, and Mistral. ARR consistently outperformed alternative methods across these diverse configurations.

The authors also present a case study illustrating how ARR can lead to correct answers by avoiding intent misunderstanding, context misuse, and faulty reasoning, which can occur in Baseline and CoT methods.

In summary, the paper makes the following claims:

  1. ARR is an effective zero-shot prompting method that improves LLM performance in various question-answering tasks.
  2. ARR consistently outperforms the Baseline and CoT methods, with each component (Analyzing, Retrieving, and Reasoning) contributing positively.
  3. ARR's effectiveness and generalizability are validated across different model sizes, LLM series, and generation configurations.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 6 likes about this paper.