Papers
Topics
Authors
Recent
Search
2000 character limit reached

Roundtrip Consistency Filtering

Updated 19 January 2026
  • The paper introduces roundtrip consistency filtering, a high-precision method to verify that a generated question retrieves the exact answer span from its context, boosting QA performance.
  • It employs a three-model pipeline based on BERT for answer extraction, question generation, and conditional answer extraction, using a strict exact span-match criterion.
  • Empirical results on SQuAD2 and NQ show that this filtering method significantly improves EM and F1 scores by reducing label noise and enhancing data quality.

Roundtrip consistency filtering is a high-precision method for generating synthetic question answering (QA) corpora by verifying the mutual agreement between answer extraction and question generation models in the context of extractive QA tasks. The central concept requires that, for a given context, a model-selected answer span and a generated question, an independent answer extraction model must recover the identical answer span when fed the pair of generated question and original context. This filter is used to curate synthetic QA triples, retaining only those that pass this roundtrip requirement, thereby improving the utility of such data for pretraining and fine-tuning QA systems (Alberti et al., 2019).

1. Formal Definition and Core Principle

Given a context CC, an extractive answer span AA (i.e., a contiguous segment of tokens in CC), and a generated question QQ, roundtrip consistency requires that a conditional answer extraction model, when presented with (Q,C)(Q, C), outputs exactly the original span AA. Formally,

A=argmaxaCp(aC,Q;θQA),A' = \arg\max_{a' \subseteq C} p(a' \mid C, Q; \theta_{QA}),

and the triple (C,Q,A)(C, Q, A) is called roundtrip-consistent iff A=AA' = A. This strict identity criterion forms the basis of the filtering mechanism, ensuring that only high-quality, recoverable (context, question, answer) tuples are retained.

2. Mathematical Framework

Three probability models underpin roundtrip filtering:

  • p1(aC;θAE)p_1(a \mid C; \theta_{AE}): unconditional answer extraction model
  • p2(qC,A;θQG)p_2(q \mid C, A; \theta_{QG}): question generation model
  • p3(aC,Q;θQA)p_3(a' \mid C, Q; \theta_{QA}): conditional answer extraction (QA) model

A "soft" consistency score can be defined as s(C,Q,A)=p3(AC,Q;θQA)s(C, Q, A) = p_3(A \mid C, Q; \theta_{QA}), representing the probability the QA model assigns to the original answer. In practice, however, a "hard" criterion is used, accepting only those triples for which argmaxap3(aC,Q;θQA)=A\arg\max_{a'} p_3(a' \mid C, Q; \theta_{QA}) = A (span-string match). No explicit probability threshold is set.

3. Filtering Algorithm and Implementation

The roundtrip filtering pipeline is applied to an unlabeled corpus, with key steps as follows:

1
2
3
4
5
6
7
8
Initialize S_synth = {}
For each context C in corpus:
    1. Get top-K spans {A^(1)...A^(K)}: top-K argmax_{a, |a|  L_A} p1(a | C)
    2. Sample A uniformly from top-K
    3. Generate Q: greedy decoding for up to L_Q tokens under p2(. | C, A)
    4. Predict A': argmax_{a', |a'| ≤ L_A} p3(a' | C, Q)
    5. If A' == A: Add (C,Q,A) to S_synth
Output S_synth

Key hyperparameters:

Parameter Value Notes
Max context length 512 tokens (truncated/padded) Contexts are windowed or padded
Top-K answer spans 10 Chosen uniformly
Max answer span length 32 wordpieces LAL_A
Max question length Fixed, e.g. 20–30 tokens (LQL_Q) Padded/truncated
Pretraining data SQuAD2, Natural Questions (extractive) All models finetuned on extractive data
Filtering criterion Exact span-string match No soft threshold

4. Model Architectures and Training Details

Roundtrip consistency filtering employs three main models, all based on the BERT architecture (Devlin et al., 2019):

  • Answer Extraction (p1p_1): BERT's final hidden layer embeddings for token indices ss and ee (start/end of a candidate span aa) are joined and passed through a multi-layer perceptron fJ(a,C)f_J(a, C). Candidate spans of up to 32 wordpieces are scored, yielding

p1(aC)exp(fJ(a,C))p_1(a \mid C) \propto \exp(f_J(a, C))

  • Question Generation (p2p_2): Two approaches are used:
    • Encoder-only BERT finetuned as a left-to-right LLM, with special token-type IDs marking the answer span in context.
    • Sequence-to-sequence Transformer (encoder + decoder), initialized from BERT, pretrained with next-sentence prediction, and finetuned on gold (C,A)Q(C,A) \rightarrow Q pairs.
  • Conditional Answer Extraction (p3p_3): Follows the design of Alberti et al. (2019), with BERT encoding the concatenation [Q;C][Q;C] and affine start/end scoring over each token:

p3(aC,Q)exp(AFFI(start))exp(AFFI(end))p_3(a|C,Q) \propto \exp(AFF_I(\text{start})) \cdot \exp(AFF_I(\text{end}))

Greedy left-to-right generation is used for questions; argmax extraction is used for all answer predictions. Finetuning is performed on extractive SQuAD2 and NQ using batch size 128, learning rate 2×1052 \times 10^{-5}, and one epoch per synthetic dataset.

5. Quantitative Performance Gains

Empirical results demonstrate that roundtrip consistency filtering materially increases the effectiveness of synthetic data for boosting downstream QA performance, as summarized:

Dataset Baseline (EM/F1) + Synthetic (RT, EM/F1) ∆EM ∆F1
SQuAD2 (Dev) 78.7 / 81.9 80.1 / 82.8 (3M RT) +1.4 +0.9
SQuAD2 (Dev) 78.7 / 81.9 81.2 / 84.0 (4M RT) +2.5 +2.1
NQ (short ans.) 52.7 55.1 (4M RT) +2.4
NQ (long ans.) 64.7 65.9 (4M RT) +1.2

Manual inspection showed a 39% correctness rate among RT-passing triples, versus only 16% for those that failed RT. Omitting roundtrip filtering resulted in a 0.5–1.0 point smaller EM gain. These gains correspond to a roughly 50% reduction in the gap to single-human performance on NQ short-answer F1, and SQuAD2 performance approaches within 0.1–0.4 points of human reference.

6. Limitations and Potential Enhancements

Roundtrip filtering enforces strict span equality, discarding over 50% of generated (C,Q,A) pairs. This hard filter admits only extractive consistency and does not capture valid paraphrases of the answer span. Alternative approaches could explore soft thresholds on p3(AC,Q)p_3(A \mid C, Q) or margin constraints in the filtering criterion.

All components in the pipeline are BERT-based; incorporating a unified joint question–answer generator, as suggested by Lewis et al. (2018), may increase diversity and coverage. Theoretical results indicate that encouraging a minimum roundtrip log-likelihood on unlabeled data could reduce sample complexity, suggesting avenues for future work in objective design (Alberti et al., 2019).

7. Context and Significance

Roundtrip consistency filtering is conceptually simple and practically effective for synthesizing QA data, providing systematic gains in both exact match and F1 scores on major QA benchmarks. By ensuring that an answer used in question generation is recoverable via a dedicated QA model, roundtrip filtering raises the likelihood that synthetic examples will benefit downstream extractive QA training. Its effectiveness is attributed to its selective retention of only those examples with tightly coupled question-answer relationships, thus mitigating label noise and boosting model precision (Alberti et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Roundtrip Consistency Filtering.