Roundtrip Consistency Filtering

Updated 19 January 2026

The paper introduces roundtrip consistency filtering, a high-precision method to verify that a generated question retrieves the exact answer span from its context, boosting QA performance.
It employs a three-model pipeline based on BERT for answer extraction, question generation, and conditional answer extraction, using a strict exact span-match criterion.
Empirical results on SQuAD2 and NQ show that this filtering method significantly improves EM and F1 scores by reducing label noise and enhancing data quality.

Roundtrip consistency filtering is a high-precision method for generating synthetic question answering (QA) corpora by verifying the mutual agreement between answer extraction and question generation models in the context of extractive QA tasks. The central concept requires that, for a given context, a model-selected answer span and a generated question, an independent answer extraction model must recover the identical answer span when fed the pair of generated question and original context. This filter is used to curate synthetic QA triples, retaining only those that pass this roundtrip requirement, thereby improving the utility of such data for pretraining and fine-tuning QA systems (Alberti et al., 2019).

1. Formal Definition and Core Principle

Given a context $C$ , an extractive answer span $A$ (i.e., a contiguous segment of tokens in $C$ ), and a generated question $Q$ , roundtrip consistency requires that a conditional answer extraction model, when presented with $(Q, C)$ , outputs exactly the original span $A$ . Formally,

$A' = \arg\max_{a' \subseteq C} p(a' \mid C, Q; \theta_{QA}),$

and the triple $(C, Q, A)$ is called roundtrip-consistent iff $A' = A$ . This strict identity criterion forms the basis of the filtering mechanism, ensuring that only high-quality, recoverable (context, question, answer) tuples are retained.

2. Mathematical Framework

Three probability models underpin roundtrip filtering:

$p_1(a \mid C; \theta_{AE})$ : unconditional answer extraction model
$p_2(q \mid C, A; \theta_{QG})$ : question generation model
$p_3(a' \mid C, Q; \theta_{QA})$ : conditional answer extraction (QA) model

A "soft" consistency score can be defined as $s(C, Q, A) = p_3(A \mid C, Q; \theta_{QA})$ , representing the probability the QA model assigns to the original answer. In practice, however, a "hard" criterion is used, accepting only those triples for which $\arg\max_{a'} p_3(a' \mid C, Q; \theta_{QA}) = A$ (span-string match). No explicit probability threshold is set.

3. Filtering Algorithm and Implementation

The roundtrip filtering pipeline is applied to an unlabeled corpus, with key steps as follows:

Initialize S_synth = {}
For each context C in corpus:
    1. Get top-K spans {A^(1)...A^(K)}: top-K argmax_{a, |a| ≤ L_A} p1(a | C)
    2. Sample A uniformly from top-K
    3. Generate Q: greedy decoding for up to L_Q tokens under p2(. | C, A)
    4. Predict A': argmax_{a', |a'| ≤ L_A} p3(a' | C, Q)
    5. If A' == A: Add (C,Q,A) to S_synth
Output S_synth

Key hyperparameters:

Parameter	Value	Notes
Max context length	512 tokens (truncated/padded)	Contexts are windowed or padded
Top-K answer spans	10	Chosen uniformly
Max answer span length	32 wordpieces	$L_A$
Max question length	Fixed, e.g. 20–30 tokens ( $L_Q$ )	Padded/truncated
Pretraining data	SQuAD2, Natural Questions (extractive)	All models finetuned on extractive data
Filtering criterion	Exact span-string match	No soft threshold

4. Model Architectures and Training Details

Roundtrip consistency filtering employs three main models, all based on the BERT architecture (Devlin et al., 2019):

Answer Extraction ( $p_1$ ): BERT's final hidden layer embeddings for token indices $s$ and $e$ (start/end of a candidate span $a$ ) are joined and passed through a multi-layer perceptron $f_J(a, C)$ . Candidate spans of up to 32 wordpieces are scored, yielding

$p_1(a \mid C) \propto \exp(f_J(a, C))$

Question Generation ( $p_2$ ): Two approaches are used:
- Encoder-only BERT finetuned as a left-to-right LLM, with special token-type IDs marking the answer span in context.
- Sequence-to-sequence Transformer (encoder + decoder), initialized from BERT, pretrained with next-sentence prediction, and finetuned on gold $(C,A) \rightarrow Q$ pairs.
Conditional Answer Extraction ( $p_3$ ): Follows the design of Alberti et al. (2019), with BERT encoding the concatenation $[Q;C]$ and affine start/end scoring over each token:

$p_3(a|C,Q) \propto \exp(AFF_I(\text{start})) \cdot \exp(AFF_I(\text{end}))$

Greedy left-to-right generation is used for questions; argmax extraction is used for all answer predictions. Finetuning is performed on extractive SQuAD2 and NQ using batch size 128, learning rate $2 \times 10^{-5}$ , and one epoch per synthetic dataset.

5. Quantitative Performance Gains

Empirical results demonstrate that roundtrip consistency filtering materially increases the effectiveness of synthetic data for boosting downstream QA performance, as summarized:

Dataset	Baseline (EM/F1)	+ Synthetic (RT, EM/F1)	∆EM	∆F1
SQuAD2 (Dev)	78.7 / 81.9	80.1 / 82.8 (3M RT)	+1.4	+0.9
SQuAD2 (Dev)	78.7 / 81.9	81.2 / 84.0 (4M RT)	+2.5	+2.1
NQ (short ans.)	52.7	55.1 (4M RT)	+2.4
NQ (long ans.)	64.7	65.9 (4M RT)	+1.2

Manual inspection showed a 39% correctness rate among RT-passing triples, versus only 16% for those that failed RT. Omitting roundtrip filtering resulted in a 0.5–1.0 point smaller EM gain. These gains correspond to a roughly 50% reduction in the gap to single-human performance on NQ short-answer F1, and SQuAD2 performance approaches within 0.1–0.4 points of human reference.

6. Limitations and Potential Enhancements

Roundtrip filtering enforces strict span equality, discarding over 50% of generated (C,Q,A) pairs. This hard filter admits only extractive consistency and does not capture valid paraphrases of the answer span. Alternative approaches could explore soft thresholds on $p_3(A \mid C, Q)$ or margin constraints in the filtering criterion.

All components in the pipeline are BERT-based; incorporating a unified joint question–answer generator, as suggested by Lewis et al. (2018), may increase diversity and coverage. Theoretical results indicate that encouraging a minimum roundtrip log-likelihood on unlabeled data could reduce sample complexity, suggesting avenues for future work in objective design (Alberti et al., 2019).

7. Context and Significance

Roundtrip consistency filtering is conceptually simple and practically effective for synthesizing QA data, providing systematic gains in both exact match and F1 scores on major QA benchmarks. By ensuring that an answer used in question generation is recoverable via a dedicated QA model, roundtrip filtering raises the likelihood that synthetic examples will benefit downstream extractive QA training. Its effectiveness is attributed to its selective retention of only those examples with tightly coupled question-answer relationships, thus mitigating label noise and boosting model precision (Alberti et al., 2019).

Markdown Report Issue Upgrade to Chat

References (1)

Synthetic QA Corpora Generation with Roundtrip Consistency (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Roundtrip Consistency Filtering.

Roundtrip Consistency Filtering

1. Formal Definition and Core Principle

2. Mathematical Framework

3. Filtering Algorithm and Implementation

4. Model Architectures and Training Details

5. Quantitative Performance Gains

6. Limitations and Potential Enhancements

7. Context and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Roundtrip Consistency Filtering

1. Formal Definition and Core Principle

2. Mathematical Framework

3. Filtering Algorithm and Implementation

4. Model Architectures and Training Details

5. Quantitative Performance Gains

6. Limitations and Potential Enhancements

7. Context and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research