MuSiQue Dataset Overview
- MuSiQue is a multihop question answering dataset that enforces strict connected reasoning using directed acyclic graphs to chain query dependencies.
- It employs a bottom-up construction pipeline with rigorous filtering and hard distractor retrieval, ensuring questions require genuine evidence aggregation.
- The dataset's two variants, MuSiQue-Ans and MuSiQue-Full, challenge models with answer extraction and context sufficiency detection, raising empirical difficulty.
MuSiQue is a multihop question answering (QA) dataset expressly constructed to require connected, compositional reasoning over multiple textual passages. Distinct from prior multihop QA benchmarks often susceptible to shortcut exploitation, MuSiQue employs a strict formal condition on its question generation pipeline to ensure that every reasoning hop in a multihop question cannot be bypassed or "cheated," thereby compelling genuine aggregation of evidence from multiple contexts. The dataset exists in two principal variants: MuSiQue-Ans, comprising approximately 25,000 answerable 2–4 hop questions with paragraph-level support annotations, and MuSiQue-Full, a contrastive extension with both answerable and minimally altered unanswerable counterparts totaling roughly 50,000 questions. The design, construction, and evaluation frameworks of MuSiQue establish new standards for empirical difficulty and shortcut-resistance in connected multihop QA (Trivedi et al., 2021).
1. Formalism: Reasoning Graphs and the MuSiQue Condition
Multihop reading comprehension in MuSiQue is formally modeled as a directed acyclic graph (DAG) associated with each -hop question . Each node corresponds to a single-hop question with gold answer , while each directed edge encodes that critically depends on the output of . Two single-hop question–answer pairs and are composable into a 2-hop question if and only if:
- is a named entity,
- mentions (requiring transfer of knowledge),
- The source paragraphs of and are distinct.
This compositional approach generalizes to -hop structures by chaining such pairs into larger DAGs (chains, trees, etc.).
The core of MuSiQue is the MuSiQue condition, which enforces connected reasoning:
- For every edge , a strong pretrained QA model must not be able to answer without access to (the output of ), denoted .
- For each node , , where is the empty context.
Equation (2) in the source formalizes these constraints to block both contextual and context-independent shortcuts.
2. Bottom-Up Dataset Construction Pipeline
The MuSiQue dataset is constructed via a rigorously staged, bottom-up pipeline emphasizing connectedness and difficulty:
- Single-Hop Sanity Filtering: Initiates from approximately 500,000 question–paragraph–answer triples aggregated from SQuAD, Natural Questions, MLQA, T-REx, and Zero-Shot RE. Question–answer pairs with annotation errors, weak QA model agreement (F1 ≤ 0), or inappropriate context length (<20 or >300 words) are excluded.
- Composable Pair Identification: Distinct pairs and are filtered such that is a named entity appearing in , does not appear in , and . Entity coreference relies on SpaCy NER tags, Wikipedia first-hit alignment, and a Wikification model.
- Disconnection Filtering: Using a Longformer QA model under 5-fold cross-validation, pairs are discarded if either head questions are solvable in the absence of context ( F1 ≥ 0.5) or tail questions are answerable with F1 > 0.25 or perfect paragraph support.
- Building -Hop DAGs: Valid 2-hop pairs serve as building blocks for assembling up to 4-hop DAGs across six canonical graph shapes under token and occurrence constraints (max 15 tokens for 2–3 hops, 20 tokens for 4 hops; limits on question/entity reuse).
- Train–Test Leakage Minimization: Overlap (by question, answer, or paragraph) is minimized via greedy selection for training, followed by further split into development/test partitions while maintaining hop diversity.
- Hard Distractor Retrieval: Each -hop question’s context incorporates gold paragraphs and 20 'positive distractors,' chosen by BM25 retrieval with bridge entities masked. Restricting distractor sources to the curated set (vs. open Wikipedia) compounds difficulty in distinguishing evidence.
- Crowdsourced Question Composition: Annotators are shown each assembled DAG, corresponding subquestions, paragraphs, and entity bridges, and tasked to rewrite these into single coherent -hop questions. Seventeen annotators are selected through a qualification round.
- Unanswerable Contrast Generation: For MuSiQue-Full, each answerable question is paired with an unanswerable version created by removing the paragraph containing one subquestion’s answer, enforcing the need to recognize context sufficiency.
The result of these phases is a dataset explicitly crafted to demand and measure connected, multihop reasoning.
3. Dataset Variants: MuSiQue-Ans and MuSiQue-Full
MuSiQue exists in two principal forms:
- MuSiQue-Ans: Approximately 25,000 answerable multihop questions, each annotated with the supporting paragraphs at the hop level.
- MuSiQue-Full: Each question in MuSiQue-Ans is paired with an unanswerable variant created by randomly excising a supporting context for a given hop. Models are required to (i) determine answerability (the ‘sufficiency’ label S) and (ii) produce the correct answer and paragraph supports when sufficient information exists.
Contrast pairs in MuSiQue-Full differ only by the removal of minimal context and thus directly assess a model’s ability to distinguish sufficient vs. insufficient information and to avoid overfitting to spurious or artifact-based cues.
4. Enforcing Connected Reasoning and Pseudocode
The enforcement of the MuSiQue condition is central to dataset integrity. The implementation, as outlined for the 2-hop case and generalizable to -hop graphs, is as follows:
1 2 3 4 5 6 7 8 9 10 |
for each candidate 2-hop pair (q1, a1, p1) → (q2, a2, p2): # 1) head dependency: ensure q1 not solvable without context if QA_model.predict(q1, context=∅) == a1: continue # discard disconnected head # 2) tail dependency: mask bridge answer a1 from q2 q2_masked = mask_mentions(q2, a1) C2 = p1 + p2 + distractors(q2_masked) if QA_model.predict(q2_masked, context=C2) == a2: continue # discard disconnected tail keep_edge((q1→q2, a2)) |
This approach is applied within a 5-fold cross-validation regime, ensuring each candidate composition is genuinely connected: no hop within a DAG is answerable without the requisite input from its predecessors or without context altogether. This constraint is rechecked at each level of the compositional pipeline.
5. Empirical Difficulty and Comparative Analysis
MuSiQue sets a new bar for multihop QA difficulty as evidenced by substantial human–machine performance gaps:
- MuSiQue-Ans: On 125 held-out questions, human annotators achieve answer F1 = 78.0 (upper bound: 88.6), support F1 = 93.9 (upper bound: 97.3). The best step-execution model EX(SA) records answer F1 = 49.8, support F1 = 79.2, with single-paragraph baselines dropping to 32.0 F1—showing that isolated contexts lack sufficient information.
- MuSiQue-Full: The same model’s joint answer+sufficiency (An+Sf) score is 32.2 and support+sufficiency (Sp+Sf) 44.3, whereas humans approach An+Sf ≈ 78 and Sp+Sf ≈ 94.
Comparatively, similar tests on 20,000 samples from HotpotQA and 2WikiMultihopQA reveal human–model gaps of 10 points or less, with artifact-only models often exceeding 60 F1. MuSiQue’s DiRe (disconnected-reasoning) analysis demonstrates further resistance to shortcut exploitation: answer F1 plunges from ≈68 on HotpotQA to 37.8 on MuSiQue-Ans.
Empirical findings indicate that MuSiQue’s questions are substantially less amenable to strategies that exploit dataset artifacts or disconnected reasoning, marking it as a stringent challenge benchmark.
6. Properties, Impact, and Research Implications
MuSiQue’s design enforces that multihop QA models are truly compositional and context-integrating, as opposed to relying on spurious correlations or isolated fact retrieval. The formal graph-based composition and bottom-up curation ensure fine-grained control over hop and entity variety, support for up to four reasoning hops, and diverse graph shapes. The explicit contrastive design in MuSiQue-Full mandates upward progress in model capability not only in answer extraction but also context sufficiency detection.
A plausible implication is that progress on MuSiQue is likely to correlate with advances in compositional reasoning, explicit evidence aggregation, and robustness to distractor contexts—core goals in machine reading comprehension and complex QA research. The dataset is positioned to guide the development of architectures and training protocols that emphasize stepwise, explainable reasoning and resist shortcut exploitation (Trivedi et al., 2021).