Papers
Topics
Authors
Recent
Search
2000 character limit reached

MuSiQue Dataset Overview

Updated 17 January 2026
  • MuSiQue is a multihop question answering dataset that enforces strict connected reasoning using directed acyclic graphs to chain query dependencies.
  • It employs a bottom-up construction pipeline with rigorous filtering and hard distractor retrieval, ensuring questions require genuine evidence aggregation.
  • The dataset's two variants, MuSiQue-Ans and MuSiQue-Full, challenge models with answer extraction and context sufficiency detection, raising empirical difficulty.

MuSiQue is a multihop question answering (QA) dataset expressly constructed to require connected, compositional reasoning over multiple textual passages. Distinct from prior multihop QA benchmarks often susceptible to shortcut exploitation, MuSiQue employs a strict formal condition on its question generation pipeline to ensure that every reasoning hop in a multihop question cannot be bypassed or "cheated," thereby compelling genuine aggregation of evidence from multiple contexts. The dataset exists in two principal variants: MuSiQue-Ans, comprising approximately 25,000 answerable 2–4 hop questions with paragraph-level support annotations, and MuSiQue-Full, a contrastive extension with both answerable and minimally altered unanswerable counterparts totaling roughly 50,000 questions. The design, construction, and evaluation frameworks of MuSiQue establish new standards for empirical difficulty and shortcut-resistance in connected multihop QA (Trivedi et al., 2021).

1. Formalism: Reasoning Graphs and the MuSiQue Condition

Multihop reading comprehension in MuSiQue is formally modeled as a directed acyclic graph (DAG) Gn=(V,E)G_n = (V,E) associated with each nn-hop question QQ. Each node qi∈Vq_i \in V corresponds to a single-hop question with gold answer aia_i, while each directed edge (qj→qi)∈E(q_j \to q_i) \in E encodes that qiq_i critically depends on the output aja_j of qjq_j. Two single-hop question–answer pairs (q1,a1)(q_1, a_1) and (q2,a2)(q_2, a_2) are composable into a 2-hop question if and only if:

  • a1a_1 is a named entity,
  • q2q_2 mentions a1a_1 (requiring transfer of a1a_1 knowledge),
  • The source paragraphs of q1q_1 and q2q_2 are distinct.

This compositional approach generalizes to kk-hop structures by chaining such pairs into larger DAGs (chains, trees, etc.).

The core of MuSiQue is the MuSiQue condition, which enforces connected reasoning:

  • For every edge (qj→qi)∈E(q_j \to q_i) \in E, a strong pretrained QA model MM must not be able to answer qiq_i without access to aja_j (the output of qjq_j), denoted M(qi[–aj];C)≠aiM(q_i[–a_j] ; C) \neq a_i.
  • For each node qi∈Vq_i \in V, M(qi;φ)≠aiM(q_i ; \varphi) \neq a_i, where φ\varphi is the empty context.

Equation (2) in the source formalizes these constraints to block both contextual and context-independent shortcuts.

2. Bottom-Up Dataset Construction Pipeline

The MuSiQue dataset is constructed via a rigorously staged, bottom-up pipeline emphasizing connectedness and difficulty:

  1. Single-Hop Sanity Filtering: Initiates from approximately 500,000 question–paragraph–answer triples aggregated from SQuAD, Natural Questions, MLQA, T-REx, and Zero-Shot RE. Question–answer pairs with annotation errors, weak QA model agreement (F1 ≤ 0), or inappropriate context length (<20 or >300 words) are excluded.
  2. Composable Pair Identification: Distinct pairs (q1,a1,p1)(q_1, a_1, p_1) and (q2,a2,p2)(q_2, a_2, p_2) are filtered such that a1a_1 is a named entity appearing in q2q_2, a2a_2 does not appear in q1q_1, and p1≠p2p_1 \neq p_2. Entity coreference relies on SpaCy NER tags, Wikipedia first-hit alignment, and a Wikification model.
  3. Disconnection Filtering: Using a Longformer QA model under 5-fold cross-validation, pairs are discarded if either head questions q1q_1 are solvable in the absence of context (M(q1;φ)M(q_1;\varphi) F1 ≥ 0.5) or tail questions q2[–a1]q_2[–a_1] are answerable with F1 > 0.25 or perfect paragraph support.
  4. Building kk-Hop DAGs: Valid 2-hop pairs serve as building blocks for assembling up to 4-hop DAGs across six canonical graph shapes under token and occurrence constraints (max 15 tokens for 2–3 hops, 20 tokens for 4 hops; limits on question/entity reuse).
  5. Train–Test Leakage Minimization: Overlap (by question, answer, or paragraph) is minimized via greedy selection for training, followed by further split into development/test partitions while maintaining hop diversity.
  6. Hard Distractor Retrieval: Each nn-hop question’s context CC incorporates nn gold paragraphs and 20 'positive distractors,' chosen by BM25 retrieval with bridge entities masked. Restricting distractor sources to the curated set (vs. open Wikipedia) compounds difficulty in distinguishing evidence.
  7. Crowdsourced Question Composition: Annotators are shown each assembled DAG, corresponding subquestions, paragraphs, and entity bridges, and tasked to rewrite these into single coherent nn-hop questions. Seventeen annotators are selected through a qualification round.
  8. Unanswerable Contrast Generation: For MuSiQue-Full, each answerable question is paired with an unanswerable version created by removing the paragraph containing one subquestion’s answer, enforcing the need to recognize context sufficiency.

The result of these phases is a dataset explicitly crafted to demand and measure connected, multihop reasoning.

3. Dataset Variants: MuSiQue-Ans and MuSiQue-Full

MuSiQue exists in two principal forms:

  • MuSiQue-Ans: Approximately 25,000 answerable multihop questions, each annotated with the supporting paragraphs at the hop level.
  • MuSiQue-Full: Each question in MuSiQue-Ans is paired with an unanswerable variant created by randomly excising a supporting context for a given hop. Models are required to (i) determine answerability (the ‘sufficiency’ label S) and (ii) produce the correct answer and paragraph supports when sufficient information exists.

Contrast pairs in MuSiQue-Full differ only by the removal of minimal context and thus directly assess a model’s ability to distinguish sufficient vs. insufficient information and to avoid overfitting to spurious or artifact-based cues.

4. Enforcing Connected Reasoning and Pseudocode

The enforcement of the MuSiQue condition is central to dataset integrity. The implementation, as outlined for the 2-hop case and generalizable to kk-hop graphs, is as follows:

1
2
3
4
5
6
7
8
9
10
for each candidate 2-hop pair (q1, a1, p1) → (q2, a2, p2):
    # 1) head dependency: ensure q1 not solvable without context
    if QA_model.predict(q1, context=∅) == a1:
        continue  # discard disconnected head
    # 2) tail dependency: mask bridge answer a1 from q2
    q2_masked = mask_mentions(q2, a1)
    C2 = p1 + p2 + distractors(q2_masked)
    if QA_model.predict(q2_masked, context=C2) == a2:
        continue  # discard disconnected tail
    keep_edge((q1→q2, a2))

This approach is applied within a 5-fold cross-validation regime, ensuring each candidate composition is genuinely connected: no hop within a DAG is answerable without the requisite input from its predecessors or without context altogether. This constraint is rechecked at each level of the compositional pipeline.

5. Empirical Difficulty and Comparative Analysis

MuSiQue sets a new bar for multihop QA difficulty as evidenced by substantial human–machine performance gaps:

  • MuSiQue-Ans: On 125 held-out questions, human annotators achieve answer F1 = 78.0 (upper bound: 88.6), support F1 = 93.9 (upper bound: 97.3). The best step-execution model EX(SA) records answer F1 = 49.8, support F1 = 79.2, with single-paragraph baselines dropping to 32.0 F1—showing that isolated contexts lack sufficient information.
  • MuSiQue-Full: The same model’s joint answer+sufficiency (An+Sf) score is 32.2 and support+sufficiency (Sp+Sf) 44.3, whereas humans approach An+Sf ≈ 78 and Sp+Sf ≈ 94.

Comparatively, similar tests on 20,000 samples from HotpotQA and 2WikiMultihopQA reveal human–model gaps of 10 points or less, with artifact-only models often exceeding 60 F1. MuSiQue’s DiRe (disconnected-reasoning) analysis demonstrates further resistance to shortcut exploitation: answer F1 plunges from ≈68 on HotpotQA to 37.8 on MuSiQue-Ans.

Empirical findings indicate that MuSiQue’s questions are substantially less amenable to strategies that exploit dataset artifacts or disconnected reasoning, marking it as a stringent challenge benchmark.

6. Properties, Impact, and Research Implications

MuSiQue’s design enforces that multihop QA models are truly compositional and context-integrating, as opposed to relying on spurious correlations or isolated fact retrieval. The formal graph-based composition and bottom-up curation ensure fine-grained control over hop and entity variety, support for up to four reasoning hops, and diverse graph shapes. The explicit contrastive design in MuSiQue-Full mandates upward progress in model capability not only in answer extraction but also context sufficiency detection.

A plausible implication is that progress on MuSiQue is likely to correlate with advances in compositional reasoning, explicit evidence aggregation, and robustness to distractor contexts—core goals in machine reading comprehension and complex QA research. The dataset is positioned to guide the development of architectures and training protocols that emphasize stepwise, explainable reasoning and resist shortcut exploitation (Trivedi et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MuSiQue Dataset.