Two-Hop Question Answering
- Two-hop QA is a complex reasoning paradigm that decomposes a query into two sequential subquestions, enabling enhanced compositional reasoning and interpretability.
- Successive prompting, hierarchical decomposition, and algorithmic strategies drive performance improvements, with methods like SUBQ and PC-SubQ significantly boosting F1 scores.
- Modular architectures and independent module training, combined with synthetic data generation, support robust fact verification and causal reasoning across various domains.
Two-hop question answering (QA) characterizes a form of complex reasoning wherein an answer is produced only after sequentially addressing at least two dependent subquestions over a structured or unstructured context. Unlike single-hop QA, where a single evidence fragment suffices, two-hop QA necessitates intermediate reasoning steps, each answer potentially informing the next subquery. This paradigm is gaining importance due to its alignment with natural human inquiry and its requirements for compositional reasoning, factual verification, and interpretability in LLM outputs.
1. Formalization and Notation
Two-hop QA can be described as a special case of complex multi-step reasoning. Formally, for a given context passage and a complex question , the process decomposes into a sequence of simple subquestions . Each subquestion yields an answer such that the interaction chain is , and the final answer is returned once the final subquestion is answered or a termination signal (e.g., “EOQ”) is emitted (Dua et al., 2022).
For two-hop QA specifically, and the interaction chain is: with . This framework generalizes to -hop reasoning, but two-hop is foundational in compositional QA benchmarks and prompting strategies.
2. Methodological Approaches
2.1 Successive Prompting ("SUBQ")
The Successive Prompting (SUBQ) paradigm iteratively decomposes into subquestions using an LM , alternating between question decomposition (QD) and question-answering (QA) steps. At each iteration:
- QD step:
where is a set of in-context exemplars for decomposition.
- QA step:
where is a set of QA demonstrations (Dua et al., 2022).
This alternation continues until the model completes the reasoning chain. In practical pipelines (e.g., HiSS in fact-checking (Zhang et al., 2023) or PC-SubQ for causal reasoning (Sgouritsa et al., 2024)), this alternation is implemented via sequential prompting, careful management of demonstration selection, and explicit control tokens to signal the decomposition or answering operations.
2.2 Hierarchical and Algorithmic Decomposition
Hierarchical step-by-step (HiSS) methods extend SUBQ logic by organizing subquestion generation into explicit hierarchies, as in multi-granularity fact-checking (claim subclaims subquestions) (Zhang et al., 2023). For algorithmic domains (e.g., causal discovery), the process is mapped to the steps of a formal algorithm (e.g., PC algorithm), where each logical step becomes one subquestion in a fixed prompt chain (PC-SubQ) (Sgouritsa et al., 2024). This deep structuring enforces reproducible, interpretable subquestion chains and enables robust performance on algorithmically constrained reasoning.
3. Data Indexing and Example Selection
Modern two-hop QA frameworks rely on vector-based retrieval over demonstration indices for optimal few-shot prompting:
- FAISS-style indices are maintained for both decomposition and answering tasks: stores (embedding of [Q, partial chain] next ) pairs; stores (embedding of simple ) pairs (Dua et al., 2022).
- At each step, nearest neighbors are retrieved and concatenated into the prompt with task-specific control tokens, maximizing LM alignment to the current reasoning step.
- Example complexity is dynamically balanced to avoid context-window overflows and confusion, typically constraining decomposition to 2–4 partial steps and QA demonstrations to single operations over small item sets.
4. Model Training, Fine-Tuning, and Synthetic Data
The decoupling of decomposition from answering allows for independent module training:
- QD and QA are trained as separate functions: and , respectively. End-to-end triples are unnecessary.
- Synthetic datasets are generated by defining atomic operations (e.g., COUNT, SUM, DIFF, FILTER) and composing them into higher-order multi-hop questions; this supports scalable bootstrapping of both QD and QA modules and injection of symbolic or specialized heads for complex subquestion types (Dua et al., 2022).
- Contrastive estimation and dynamic re-sampling ensure balanced and effective fine-tuning, boosting F1 on end-task datasets.
5. Performance, Evaluation, and Comparative Results
Two-hop QA methods are evaluated using macro-averaged metrics such as precision, recall, and F1 over standard benchmarks (DROP, Corr2Cause, RAWFC, LIAR):
| Setting (Model/Prompt) | F1 (DROP, two-hop) | F1 (Corr2Cause, causal) | F1 (RAWFC, fact-check) |
|---|---|---|---|
| Standard LM (few-shot, no rationale) | 24.9 (Dua et al., 2022) | 0.30 (few-shot+CoT) (Sgouritsa et al., 2024) | 52.0 (CofCED) (Zhang et al., 2023) |
| Chain-of-Thought (CoT) | 27.6 | — | — |
| SUBQ w/ symbolic calc | 31.9 | — | — |
| SUBQ + fine-tuned modules | 51.3 | — | — |
| PC-SubQ (alg. decomposition) | — | 0.64 (PaLM 2 L) | — |
| HiSS (hierarchical SUBQ) | — | — | 53.9 |
These results show that explicit subquestion-based (SUBQ) methods, including successive prompting, hierarchical decomposition (HiSS), and algorithm-mapped PC-SubQ, outperform baseline chain-of-thought and standard prompting methods, often by sizable F1 margins. In PC-SubQ, robustness to variable refactoring and query paraphrasing is demonstrated (Sgouritsa et al., 2024), and in HiSS, hallucination and omission rates are reduced to 5% and 13%, respectively (Zhang et al., 2023).
6. Implementation Best Practices and Interpretability
Key principles for robust two-hop QA implementations include:
- Explicit control tokens (e.g., "QD:", "QA:", "EOQ") to delimit decomposition and answering stages.
- Separation of modules: symbolic QA heads or algorithmic solvers can be seamlessly injected for subquestions unsuited for free-form language modeling (COUNT, SUM, DIFF, logic).
- Iterative, modular chaining: the subquestion-answer record is fully auditable for transparency, error diagnosis, and downstream interpretability (Zhang et al., 2023).
- Search augmentation: information retrieval is dynamically invoked when model confidence is low on a subquestion (HiSS paradigm).
- Chain-of-thought isolation: only propagate answers (not rationales) to conserve context space and minimize error cascade.
A plausible implication is that these modular, explicit strategies are likely necessary as models are scaled to harder, less supervised, or less factoid-centric two-hop (and multi-hop) reasoning benchmarks.
7. Applications and Extensions
Two-hop QA underpins:
- Open-domain QA with compositional queries (e.g., HotpotQA)
- Causal reasoning conforming to algorithmic protocols (e.g., PC-SubQ and Corr2Cause (Sgouritsa et al., 2024))
- Fact verification requiring structured decomposition and evidence chaining (e.g., HiSS and LIAR/RAWFC (Zhang et al., 2023))
- Scientific, legal, and medical question answering, where veracity and multi-granular evidence synthesis are critical
Extensions encompass deeper multi-hop settings, integration in real-world fact-checking pipelines, and generalization to domains requiring explicit, auditable reasoning chains.
References:
- Successive Prompting for Decomposing Complex Questions (Dua et al., 2022)
- Towards LLM-based Fact Verification on News Claims with a Hierarchical Step-by-Step Prompting Method (Zhang et al., 2023)
- Prompting Strategies for Enabling LLMs to Infer Causation from Correlation (Sgouritsa et al., 2024)