Two-Hop Question Answering

Updated 17 January 2026

Two-hop QA is a complex reasoning paradigm that decomposes a query into two sequential subquestions, enabling enhanced compositional reasoning and interpretability.
Successive prompting, hierarchical decomposition, and algorithmic strategies drive performance improvements, with methods like SUBQ and PC-SubQ significantly boosting F1 scores.
Modular architectures and independent module training, combined with synthetic data generation, support robust fact verification and causal reasoning across various domains.

Two-hop question answering (QA) characterizes a form of complex reasoning wherein an answer is produced only after sequentially addressing at least two dependent subquestions over a structured or unstructured context. Unlike single-hop QA, where a single evidence fragment suffices, two-hop QA necessitates intermediate reasoning steps, each answer potentially informing the next subquery. This paradigm is gaining importance due to its alignment with natural human inquiry and its requirements for compositional reasoning, factual verification, and interpretability in LLM outputs.

1. Formalization and Notation

Two-hop QA can be described as a special case of complex multi-step reasoning. Formally, for a given context passage $p \in \mathcal{P}$ and a complex question $Q$ , the process decomposes $Q$ into a sequence of $s$ simple subquestions $\{q_1, q_2, \ldots, q_s\}$ . Each subquestion $q_k$ yields an answer $a_k$ such that the interaction chain is $z = \left[(q_1,a_1), (q_2,a_2), \ldots, (q_s,a_s)\right]$ , and the final answer $y = a_s$ is returned once the final subquestion is answered or a termination signal (e.g., “EOQ”) is emitted (Dua et al., 2022).

For two-hop QA specifically, $s = 2$ and the interaction chain is: $z = \big((q_1, a_1), (q_2, a_2)\big)$ with $y = a_2$ . This framework generalizes to $n$ -hop reasoning, but two-hop is foundational in compositional QA benchmarks and prompting strategies.

2. Methodological Approaches

2.1 Successive Prompting ("SUBQ")

The Successive Prompting (SUBQ) paradigm iteratively decomposes $Q$ into subquestions using an LM $\mathbb{L}$ , alternating between question decomposition (QD) and question-answering (QA) steps. At each iteration:

QD step:

$q_k \leftarrow \mathbb{L}\left(p, q_1, a_1, \ldots, q_{k-1}, a_{k-1}, \mathcal{D}_k\right)$

where $\mathcal{D}_k$ is a set of in-context exemplars for decomposition.
QA step:

$a_k \leftarrow \mathbb{L}\left(p, q_k, \mathcal{A}\right)$

where $\mathcal{A}$ is a set of QA demonstrations (Dua et al., 2022).

This alternation continues until the model completes the reasoning chain. In practical pipelines (e.g., HiSS in fact-checking (Zhang et al., 2023) or PC-SubQ for causal reasoning (Sgouritsa et al., 2024)), this alternation is implemented via sequential prompting, careful management of demonstration selection, and explicit control tokens to signal the decomposition or answering operations.

2.2 Hierarchical and Algorithmic Decomposition

Hierarchical step-by-step (HiSS) methods extend SUBQ logic by organizing subquestion generation into explicit hierarchies, as in multi-granularity fact-checking (claim $\to$ subclaims $\to$ subquestions) (Zhang et al., 2023). For algorithmic domains (e.g., causal discovery), the process is mapped to the steps of a formal algorithm (e.g., PC algorithm), where each logical step becomes one subquestion in a fixed prompt chain (PC-SubQ) (Sgouritsa et al., 2024). This deep structuring enforces reproducible, interpretable subquestion chains and enables robust performance on algorithmically constrained reasoning.

3. Data Indexing and Example Selection

Modern two-hop QA frameworks rely on vector-based retrieval over demonstration indices for optimal few-shot prompting:

FAISS-style indices are maintained for both decomposition and answering tasks: $\mathcal{I}_D$ stores (embedding of [Q, partial chain] $\to$ next $q_k$ ) pairs; $\mathcal{I}_A$ stores (embedding of simple $q \to a$ ) pairs (Dua et al., 2022).
At each step, nearest neighbors are retrieved and concatenated into the prompt with task-specific control tokens, maximizing LM alignment to the current reasoning step.
Example complexity is dynamically balanced to avoid context-window overflows and confusion, typically constraining decomposition to 2–4 partial steps and QA demonstrations to single operations over small item sets.

4. Model Training, Fine-Tuning, and Synthetic Data

The decoupling of decomposition from answering allows for independent module training:

QD and QA are trained as separate functions: $(x,\text{chain}_{<k}) \to q_k$ and $q \to a$ , respectively. End-to-end triples are unnecessary.
Synthetic datasets are generated by defining atomic operations (e.g., COUNT, SUM, DIFF, FILTER) and composing them into higher-order multi-hop questions; this supports scalable bootstrapping of both QD and QA modules and injection of symbolic or specialized heads for complex subquestion types (Dua et al., 2022).
Contrastive estimation and dynamic re-sampling ensure balanced and effective fine-tuning, boosting F1 on end-task datasets.

5. Performance, Evaluation, and Comparative Results

Two-hop QA methods are evaluated using macro-averaged metrics such as precision, recall, and F1 over standard benchmarks (DROP, Corr2Cause, RAWFC, LIAR):

Setting (Model/Prompt)	F1 (DROP, two-hop)	F1 (Corr2Cause, causal)	F1 (RAWFC, fact-check)
Standard LM (few-shot, no rationale)	24.9 (Dua et al., 2022)	0.30 (few-shot+CoT) (Sgouritsa et al., 2024)	52.0 (CofCED) (Zhang et al., 2023)
Chain-of-Thought (CoT)	27.6	—	—
SUBQ w/ symbolic calc	31.9	—	—
SUBQ + fine-tuned modules	51.3	—	—
PC-SubQ (alg. decomposition)	—	0.64 (PaLM 2 L)	—
HiSS (hierarchical SUBQ)	—	—	53.9

These results show that explicit subquestion-based (SUBQ) methods, including successive prompting, hierarchical decomposition (HiSS), and algorithm-mapped PC-SubQ, outperform baseline chain-of-thought and standard prompting methods, often by sizable F1 margins. In PC-SubQ, robustness to variable refactoring and query paraphrasing is demonstrated (Sgouritsa et al., 2024), and in HiSS, hallucination and omission rates are reduced to 5% and 13%, respectively (Zhang et al., 2023).

6. Implementation Best Practices and Interpretability

Key principles for robust two-hop QA implementations include:

Explicit control tokens (e.g., "QD:", "QA:", "EOQ") to delimit decomposition and answering stages.
Separation of modules: symbolic QA heads or algorithmic solvers can be seamlessly injected for subquestions unsuited for free-form language modeling (COUNT, SUM, DIFF, logic).
Iterative, modular chaining: the subquestion-answer record $z$ is fully auditable for transparency, error diagnosis, and downstream interpretability (Zhang et al., 2023).
Search augmentation: information retrieval is dynamically invoked when model confidence is low on a subquestion (HiSS paradigm).
Chain-of-thought isolation: only propagate answers (not rationales) to conserve context space and minimize error cascade.

A plausible implication is that these modular, explicit strategies are likely necessary as models are scaled to harder, less supervised, or less factoid-centric two-hop (and multi-hop) reasoning benchmarks.

7. Applications and Extensions

Two-hop QA underpins:

Open-domain QA with compositional queries (e.g., HotpotQA)
Causal reasoning conforming to algorithmic protocols (e.g., PC-SubQ and Corr2Cause (Sgouritsa et al., 2024))
Fact verification requiring structured decomposition and evidence chaining (e.g., HiSS and LIAR/RAWFC (Zhang et al., 2023))
Scientific, legal, and medical question answering, where veracity and multi-granular evidence synthesis are critical

Extensions encompass deeper multi-hop settings, integration in real-world fact-checking pipelines, and generalization to domains requiring explicit, auditable reasoning chains.

References:

Successive Prompting for Decomposing Complex Questions (Dua et al., 2022)
Towards LLM-based Fact Verification on News Claims with a Hierarchical Step-by-Step Prompting Method (Zhang et al., 2023)
Prompting Strategies for Enabling LLMs to Infer Causation from Correlation (Sgouritsa et al., 2024)

Markdown Report Issue Upgrade to Chat

References (3)

Successive Prompting for Decomposing Complex Questions (2022)

Towards LLM-based Fact Verification on News Claims with a Hierarchical Step-by-Step Prompting Method (2023)

Prompting Strategies for Enabling Large Language Models to Infer Causation from Correlation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Two-Hop Question Answering.

Two-Hop Question Answering

1. Formalization and Notation

2. Methodological Approaches

2.1 Successive Prompting ("SUBQ")

2.2 Hierarchical and Algorithmic Decomposition

3. Data Indexing and Example Selection

4. Model Training, Fine-Tuning, and Synthetic Data

5. Performance, Evaluation, and Comparative Results

6. Implementation Best Practices and Interpretability

7. Applications and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Two-Hop Question Answering

1. Formalization and Notation

2. Methodological Approaches

2.1 Successive Prompting ("SUBQ")

2.2 Hierarchical and Algorithmic Decomposition

3. Data Indexing and Example Selection

4. Model Training, Fine-Tuning, and Synthetic Data

5. Performance, Evaluation, and Comparative Results

6. Implementation Best Practices and Interpretability

7. Applications and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research