Papers
Topics
Authors
Recent
Search
2000 character limit reached

Two-Hop Question Answering

Updated 17 January 2026
  • Two-hop QA is a complex reasoning paradigm that decomposes a query into two sequential subquestions, enabling enhanced compositional reasoning and interpretability.
  • Successive prompting, hierarchical decomposition, and algorithmic strategies drive performance improvements, with methods like SUBQ and PC-SubQ significantly boosting F1 scores.
  • Modular architectures and independent module training, combined with synthetic data generation, support robust fact verification and causal reasoning across various domains.

Two-hop question answering (QA) characterizes a form of complex reasoning wherein an answer is produced only after sequentially addressing at least two dependent subquestions over a structured or unstructured context. Unlike single-hop QA, where a single evidence fragment suffices, two-hop QA necessitates intermediate reasoning steps, each answer potentially informing the next subquery. This paradigm is gaining importance due to its alignment with natural human inquiry and its requirements for compositional reasoning, factual verification, and interpretability in LLM outputs.

1. Formalization and Notation

Two-hop QA can be described as a special case of complex multi-step reasoning. Formally, for a given context passage pPp \in \mathcal{P} and a complex question QQ, the process decomposes QQ into a sequence of ss simple subquestions {q1,q2,,qs}\{q_1, q_2, \ldots, q_s\}. Each subquestion qkq_k yields an answer aka_k such that the interaction chain is z=[(q1,a1),(q2,a2),,(qs,as)]z = \left[(q_1,a_1), (q_2,a_2), \ldots, (q_s,a_s)\right], and the final answer y=asy = a_s is returned once the final subquestion is answered or a termination signal (e.g., “EOQ”) is emitted (Dua et al., 2022).

For two-hop QA specifically, s=2s = 2 and the interaction chain is: z=((q1,a1),(q2,a2))z = \big((q_1, a_1), (q_2, a_2)\big) with y=a2y = a_2. This framework generalizes to nn-hop reasoning, but two-hop is foundational in compositional QA benchmarks and prompting strategies.

2. Methodological Approaches

2.1 Successive Prompting ("SUBQ")

The Successive Prompting (SUBQ) paradigm iteratively decomposes QQ into subquestions using an LM L\mathbb{L}, alternating between question decomposition (QD) and question-answering (QA) steps. At each iteration:

  • QD step:

    qkL(p,q1,a1,,qk1,ak1,Dk)q_k \leftarrow \mathbb{L}\left(p, q_1, a_1, \ldots, q_{k-1}, a_{k-1}, \mathcal{D}_k\right)

    where Dk\mathcal{D}_k is a set of in-context exemplars for decomposition.

  • QA step:

    akL(p,qk,A)a_k \leftarrow \mathbb{L}\left(p, q_k, \mathcal{A}\right)

    where A\mathcal{A} is a set of QA demonstrations (Dua et al., 2022).

This alternation continues until the model completes the reasoning chain. In practical pipelines (e.g., HiSS in fact-checking (Zhang et al., 2023) or PC-SubQ for causal reasoning (Sgouritsa et al., 2024)), this alternation is implemented via sequential prompting, careful management of demonstration selection, and explicit control tokens to signal the decomposition or answering operations.

2.2 Hierarchical and Algorithmic Decomposition

Hierarchical step-by-step (HiSS) methods extend SUBQ logic by organizing subquestion generation into explicit hierarchies, as in multi-granularity fact-checking (claim \to subclaims \to subquestions) (Zhang et al., 2023). For algorithmic domains (e.g., causal discovery), the process is mapped to the steps of a formal algorithm (e.g., PC algorithm), where each logical step becomes one subquestion in a fixed prompt chain (PC-SubQ) (Sgouritsa et al., 2024). This deep structuring enforces reproducible, interpretable subquestion chains and enables robust performance on algorithmically constrained reasoning.

3. Data Indexing and Example Selection

Modern two-hop QA frameworks rely on vector-based retrieval over demonstration indices for optimal few-shot prompting:

  • FAISS-style indices are maintained for both decomposition and answering tasks: ID\mathcal{I}_D stores (embedding of [Q, partial chain] \to next qkq_k) pairs; IA\mathcal{I}_A stores (embedding of simple qaq \to a) pairs (Dua et al., 2022).
  • At each step, nearest neighbors are retrieved and concatenated into the prompt with task-specific control tokens, maximizing LM alignment to the current reasoning step.
  • Example complexity is dynamically balanced to avoid context-window overflows and confusion, typically constraining decomposition to 2–4 partial steps and QA demonstrations to single operations over small item sets.

4. Model Training, Fine-Tuning, and Synthetic Data

The decoupling of decomposition from answering allows for independent module training:

  • QD and QA are trained as separate functions: (x,chain<k)qk(x,\text{chain}_{<k}) \to q_k and qaq \to a, respectively. End-to-end triples are unnecessary.
  • Synthetic datasets are generated by defining atomic operations (e.g., COUNT, SUM, DIFF, FILTER) and composing them into higher-order multi-hop questions; this supports scalable bootstrapping of both QD and QA modules and injection of symbolic or specialized heads for complex subquestion types (Dua et al., 2022).
  • Contrastive estimation and dynamic re-sampling ensure balanced and effective fine-tuning, boosting F1 on end-task datasets.

5. Performance, Evaluation, and Comparative Results

Two-hop QA methods are evaluated using macro-averaged metrics such as precision, recall, and F1 over standard benchmarks (DROP, Corr2Cause, RAWFC, LIAR):

Setting (Model/Prompt) F1 (DROP, two-hop) F1 (Corr2Cause, causal) F1 (RAWFC, fact-check)
Standard LM (few-shot, no rationale) 24.9 (Dua et al., 2022) 0.30 (few-shot+CoT) (Sgouritsa et al., 2024) 52.0 (CofCED) (Zhang et al., 2023)
Chain-of-Thought (CoT) 27.6
SUBQ w/ symbolic calc 31.9
SUBQ + fine-tuned modules 51.3
PC-SubQ (alg. decomposition) 0.64 (PaLM 2 L)
HiSS (hierarchical SUBQ) 53.9

These results show that explicit subquestion-based (SUBQ) methods, including successive prompting, hierarchical decomposition (HiSS), and algorithm-mapped PC-SubQ, outperform baseline chain-of-thought and standard prompting methods, often by sizable F1 margins. In PC-SubQ, robustness to variable refactoring and query paraphrasing is demonstrated (Sgouritsa et al., 2024), and in HiSS, hallucination and omission rates are reduced to 5% and 13%, respectively (Zhang et al., 2023).

6. Implementation Best Practices and Interpretability

Key principles for robust two-hop QA implementations include:

  • Explicit control tokens (e.g., "QD:", "QA:", "EOQ") to delimit decomposition and answering stages.
  • Separation of modules: symbolic QA heads or algorithmic solvers can be seamlessly injected for subquestions unsuited for free-form language modeling (COUNT, SUM, DIFF, logic).
  • Iterative, modular chaining: the subquestion-answer record zz is fully auditable for transparency, error diagnosis, and downstream interpretability (Zhang et al., 2023).
  • Search augmentation: information retrieval is dynamically invoked when model confidence is low on a subquestion (HiSS paradigm).
  • Chain-of-thought isolation: only propagate answers (not rationales) to conserve context space and minimize error cascade.

A plausible implication is that these modular, explicit strategies are likely necessary as models are scaled to harder, less supervised, or less factoid-centric two-hop (and multi-hop) reasoning benchmarks.

7. Applications and Extensions

Two-hop QA underpins:

  • Open-domain QA with compositional queries (e.g., HotpotQA)
  • Causal reasoning conforming to algorithmic protocols (e.g., PC-SubQ and Corr2Cause (Sgouritsa et al., 2024))
  • Fact verification requiring structured decomposition and evidence chaining (e.g., HiSS and LIAR/RAWFC (Zhang et al., 2023))
  • Scientific, legal, and medical question answering, where veracity and multi-granular evidence synthesis are critical

Extensions encompass deeper multi-hop settings, integration in real-world fact-checking pipelines, and generalization to domains requiring explicit, auditable reasoning chains.


References:

  • Successive Prompting for Decomposing Complex Questions (Dua et al., 2022)
  • Towards LLM-based Fact Verification on News Claims with a Hierarchical Step-by-Step Prompting Method (Zhang et al., 2023)
  • Prompting Strategies for Enabling LLMs to Infer Causation from Correlation (Sgouritsa et al., 2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Two-Hop Question Answering.