Roundtrip Consistency Filtering
- The paper introduces roundtrip consistency filtering, a high-precision method to verify that a generated question retrieves the exact answer span from its context, boosting QA performance.
- It employs a three-model pipeline based on BERT for answer extraction, question generation, and conditional answer extraction, using a strict exact span-match criterion.
- Empirical results on SQuAD2 and NQ show that this filtering method significantly improves EM and F1 scores by reducing label noise and enhancing data quality.
Roundtrip consistency filtering is a high-precision method for generating synthetic question answering (QA) corpora by verifying the mutual agreement between answer extraction and question generation models in the context of extractive QA tasks. The central concept requires that, for a given context, a model-selected answer span and a generated question, an independent answer extraction model must recover the identical answer span when fed the pair of generated question and original context. This filter is used to curate synthetic QA triples, retaining only those that pass this roundtrip requirement, thereby improving the utility of such data for pretraining and fine-tuning QA systems (Alberti et al., 2019).
1. Formal Definition and Core Principle
Given a context , an extractive answer span (i.e., a contiguous segment of tokens in ), and a generated question , roundtrip consistency requires that a conditional answer extraction model, when presented with , outputs exactly the original span . Formally,
and the triple is called roundtrip-consistent iff . This strict identity criterion forms the basis of the filtering mechanism, ensuring that only high-quality, recoverable (context, question, answer) tuples are retained.
2. Mathematical Framework
Three probability models underpin roundtrip filtering:
- : unconditional answer extraction model
- : question generation model
- : conditional answer extraction (QA) model
A "soft" consistency score can be defined as , representing the probability the QA model assigns to the original answer. In practice, however, a "hard" criterion is used, accepting only those triples for which (span-string match). No explicit probability threshold is set.
3. Filtering Algorithm and Implementation
The roundtrip filtering pipeline is applied to an unlabeled corpus, with key steps as follows:
1 2 3 4 5 6 7 8 |
Initialize S_synth = {}
For each context C in corpus:
1. Get top-K spans {A^(1)...A^(K)}: top-K argmax_{a, |a| ≤ L_A} p1(a | C)
2. Sample A uniformly from top-K
3. Generate Q: greedy decoding for up to L_Q tokens under p2(. | C, A)
4. Predict A': argmax_{a', |a'| ≤ L_A} p3(a' | C, Q)
5. If A' == A: Add (C,Q,A) to S_synth
Output S_synth |
Key hyperparameters:
| Parameter | Value | Notes |
|---|---|---|
| Max context length | 512 tokens (truncated/padded) | Contexts are windowed or padded |
| Top-K answer spans | 10 | Chosen uniformly |
| Max answer span length | 32 wordpieces | |
| Max question length | Fixed, e.g. 20–30 tokens () | Padded/truncated |
| Pretraining data | SQuAD2, Natural Questions (extractive) | All models finetuned on extractive data |
| Filtering criterion | Exact span-string match | No soft threshold |
4. Model Architectures and Training Details
Roundtrip consistency filtering employs three main models, all based on the BERT architecture (Devlin et al., 2019):
- Answer Extraction (): BERT's final hidden layer embeddings for token indices and (start/end of a candidate span ) are joined and passed through a multi-layer perceptron . Candidate spans of up to 32 wordpieces are scored, yielding
- Question Generation (): Two approaches are used:
- Encoder-only BERT finetuned as a left-to-right LLM, with special token-type IDs marking the answer span in context.
- Sequence-to-sequence Transformer (encoder + decoder), initialized from BERT, pretrained with next-sentence prediction, and finetuned on gold pairs.
- Conditional Answer Extraction (): Follows the design of Alberti et al. (2019), with BERT encoding the concatenation and affine start/end scoring over each token:
Greedy left-to-right generation is used for questions; argmax extraction is used for all answer predictions. Finetuning is performed on extractive SQuAD2 and NQ using batch size 128, learning rate , and one epoch per synthetic dataset.
5. Quantitative Performance Gains
Empirical results demonstrate that roundtrip consistency filtering materially increases the effectiveness of synthetic data for boosting downstream QA performance, as summarized:
| Dataset | Baseline (EM/F1) | + Synthetic (RT, EM/F1) | ∆EM | ∆F1 |
|---|---|---|---|---|
| SQuAD2 (Dev) | 78.7 / 81.9 | 80.1 / 82.8 (3M RT) | +1.4 | +0.9 |
| SQuAD2 (Dev) | 78.7 / 81.9 | 81.2 / 84.0 (4M RT) | +2.5 | +2.1 |
| NQ (short ans.) | 52.7 | 55.1 (4M RT) | +2.4 | |
| NQ (long ans.) | 64.7 | 65.9 (4M RT) | +1.2 |
Manual inspection showed a 39% correctness rate among RT-passing triples, versus only 16% for those that failed RT. Omitting roundtrip filtering resulted in a 0.5–1.0 point smaller EM gain. These gains correspond to a roughly 50% reduction in the gap to single-human performance on NQ short-answer F1, and SQuAD2 performance approaches within 0.1–0.4 points of human reference.
6. Limitations and Potential Enhancements
Roundtrip filtering enforces strict span equality, discarding over 50% of generated (C,Q,A) pairs. This hard filter admits only extractive consistency and does not capture valid paraphrases of the answer span. Alternative approaches could explore soft thresholds on or margin constraints in the filtering criterion.
All components in the pipeline are BERT-based; incorporating a unified joint question–answer generator, as suggested by Lewis et al. (2018), may increase diversity and coverage. Theoretical results indicate that encouraging a minimum roundtrip log-likelihood on unlabeled data could reduce sample complexity, suggesting avenues for future work in objective design (Alberti et al., 2019).
7. Context and Significance
Roundtrip consistency filtering is conceptually simple and practically effective for synthesizing QA data, providing systematic gains in both exact match and F1 scores on major QA benchmarks. By ensuring that an answer used in question generation is recoverable via a dedicated QA model, roundtrip filtering raises the likelihood that synthetic examples will benefit downstream extractive QA training. Its effectiveness is attributed to its selective retention of only those examples with tightly coupled question-answer relationships, thus mitigating label noise and boosting model precision (Alberti et al., 2019).