- The paper introduces Analyzing, Retrieving, and Reasoning (ARR), a novel zero-shot prompting method guiding LLMs through explicit steps to improve question answering performance.
- Evaluated across 10 diverse QA datasets, ARR consistently outperforms baseline and zero-shot CoT methods, showing an average improvement of +4.1% over the baseline.
- Ablation studies confirm each ARR component contributes positively, with 'Analyzing' yielding the largest gains, and the method generalizes across various LLMs and configurations.
The paper introduces the Analyzing, Retrieving, and Reasoning (ARR) method, a novel zero-shot prompting technique designed to enhance the performance of LLMs in question-answering (QA) tasks. The ARR method explicitly incorporates three key steps: analyzing the intent of the question, retrieving relevant information, and step-by-step reasoning. The authors posit that by guiding LLMs through these steps, the ARR prompt can improve accuracy across various QA tasks.
The ARR method is motivated by the observation that while zero-shot Chain-of-Thought (CoT) prompting improves reasoning in LLMs, it only provides vague guidance. The authors argue that answering complex questions involves understanding the question's intent, retrieving relevant information, and applying inductive and deductive reasoning. The ARR method operationalizes this by using the prompt: "Let's analyze the intent of the question, find relevant information, and answer the question with step-by-step reasoning."
The method's efficacy was evaluated using open-weights LLMs on 10 multiple-choice QA datasets, encompassing reading comprehension, commonsense reasoning, world knowledge, and multitask understanding. These datasets include BoolQ, LogiQA, CommonsenseQA (CSQA), SocialIQA (SIQA), SciQ, OpenBookQA (OBQA), AI2 Reasoning Challenge (ARC), BIG-Bench Hard (BBH), Multitask Language Understanding (MMLU), and MMLU-Pro.
The paper details the two-stage process used for QA: reasoning generation and option selection. In the first stage, the LLM generates a response ri given the input prompt xi:
ri=M(x~i)
where:
- M is the LLM
- x~i is the tokenized representation of the input text xi
In the second stage, the original input xi, the generated response ri, and each choice oij are combined to form a new prompt zij:
zij=P(xi,ri,oij)
where:
- P is a prompt function that concatenates the string objects
The cross-entropy loss Lij for each zij is then computed:
Lij=−k∑logPr(tij;k∣tij;<k;Θ)
where:
- tij;k is the k-th token
- Θ represents the parameters of M
The option with the lowest loss is selected as the answer:
y^i=j∈{1,2,…,m}argmin{Lij}j=1m
The overall accuracy α is then calculated as:
α=n1i=1∑nI(yi=y^i)
where:
- I is an indicator function that returns $1$ if yi=y^i and $0$ otherwise
The main results show that ARR consistently improves QA performance across all datasets compared to a baseline method without a specific trigger sentence and the zero-shot CoT method. For example, ARR achieves an average improvement of +4.1% over the baseline.
Ablation studies were conducted to assess the contribution of each component of ARR (Analyzing, Retrieving, and Reasoning). The results indicate that each component individually outperforms both the Baseline and CoT methods, confirming their positive impact. The Analyzing-only setting yielded the largest performance gain on average, suggesting that intent analysis plays a critical role in question answering.
The generalizability of ARR was evaluated across various settings, including different model sizes, LLM series, generation temperatures, and few-shot scenarios. The LLMs used in these experiments include LLaMA3-8B-Chat, LLaMA3-Chat models (1B and 3B parameters), Qwen2.5, Gemma, and Mistral. ARR consistently outperformed alternative methods across these diverse configurations.
The authors also present a case study illustrating how ARR can lead to correct answers by avoiding intent misunderstanding, context misuse, and faulty reasoning, which can occur in Baseline and CoT methods.
In summary, the paper makes the following claims:
- ARR is an effective zero-shot prompting method that improves LLM performance in various question-answering tasks.
- ARR consistently outperforms the Baseline and CoT methods, with each component (Analyzing, Retrieving, and Reasoning) contributing positively.
- ARR's effectiveness and generalizability are validated across different model sizes, LLM series, and generation configurations.