WikiSplit Dataset: Sentence Rewriting Corpus
- WikiSplit is a large-scale, naturally occurring split-and-rephrase corpus derived from Wikipedia edit histories that provides over one million sentence split instances.
- The dataset is constructed using heuristic extraction with trigram overlap and BLEU filtering, ensuring naturalness while inherently containing label noise.
- Refinements in WikiSplit++—using NLI filtering and sentence-order reversal—yield improved semantic entailment and reduced copying in neural sequence-to-sequence models.
WikiSplit is a large-scale, naturally occurring split-and-rephrase corpus constructed from Wikipedia edit histories. It provides over one million sentence split instances for training and evaluating neural sequence-to-sequence models in NLP, particularly for the task of decomposing complex sentences into sets of simpler sentences with equivalent semantics. WikiSplit exhibits substantial gains in corpus size and lexical coverage relative to prior datasets such as WebSplit, and its extraction protocol ensures diversity and naturalness by mining actual Wikipedia editorial revisions rather than synthetic templates. However, the corpus natively contains label noise due to automated mining and semantic mismatch, driving subsequent refinement efforts such as WikiSplit++ that aim to improve faithfulness and utility.
1. Data Construction and Extraction Protocol
WikiSplit is extracted automatically from Wikipedia edit histories, leveraging the revision trails that capture changes from one snapshot to . The extraction pipeline involves segmenting both timepoints into sentences, identifying cases where an editor replaces a single complex sentence with two or more simple sentences in a new revision. Heuristic matching employs trigram overlap criteria to ensure contiguous coverage:
- The first three trigrams of must appear in , and the last three in .
- The trigram suffix of must differ from that of to avoid trivial self-joins.
- Minimum BLEU threshold ensures surface similarity between and candidate splits.
Filtering removes instances with excessive token repetitions, anomalous token length, and profanity/vandalism. For each complex sentence, only the highest summed BLEU split pair is retained. All extracted split instances in WikiSplit are binary (), meaning each complex input is split into exactly two simple sentences (Botha et al., 2018).
2. Corpus Statistics and Dataset Organization
WikiSplit achieves an order-of-magnitude scale expansion over previous benchmarks. Its statistics are as follows:
| Partition | # Instances |
|---|---|
| train | 795,585 |
| dev | 99,448 |
| test | 99,448 |
| overall | 994,481 |
Token-level statistics indicate the mean length of is approximately 33.1 tokens, while simple sentences average 11.0 tokens each. Distinct splits and vocabulary are much higher than in synthetic WebSplit datasets, achieving 60× more examples and 90× more distinct lexical items. Examples display naturalistic phrasing and topic diversity reflective of Wikipedia content.
3. Noisy Labels and Dataset Limitations
Despite rigorous surface-form heuristics, WikiSplit contains significant label noise:
- Hallucination (extrinsic): Some simple sentences introduce unverifiable facts or paraphrases not entailed by , including erroneous dates, extraneous modifiers, or additional clauses.
- Under-splitting: Models trained on WikiSplit may learn to copy input verbatim, minimizing loss without substantive splitting, due to rare punctuation and unrewarded split points.
- Unsupported/Missing Facts: Analysis confirms that approximately 32–40% of instances do not strictly satisfy for every , meaning semantic entailment is not guaranteed (Tsukagoshi et al., 2024).
This results in model training artifacts and limits deployment in faithfulness-critical tasks.
4. Data Refinement via WikiSplit++
WikiSplit++ introduces a two-step refinement of the original corpus:
a. NLI Filtering Each pair is passed through a trained DeBERTa-v2 XXL natural language inference (NLI) classifier (fine-tuned on MNLI). Instances are retained iff for all :
where , , and are predicted entailment, neutral, and contradiction probabilities respectively.
b. Sentence-Order Reversing To counteract copying and encourage genuine split learning, the gold reference order of the simple sentences is reversed to after NLI filtering. Sentence boundaries are detected using PySBD prior to tokenization.
The resulting WikiSplit++ corpus contains 630,433 high-quality split instances, having removed ≈36.6% of originals due to failed entailment (Tsukagoshi et al., 2024).
5. Empirical Evaluation and Impact
Training sequence-to-sequence models (e.g., T5-small) on the WikiSplit++ corpus yields marked improvements in semantic faithfulness and reduction of hallucination. Key metrics from HSplit benchmark evaluation demonstrate:
| Metric | WikiSplit | WikiSplit++ | Δ |
|---|---|---|---|
| BLEU | 87.95 | 88.06 | +0.11 |
| BERTScore | 96.65 | 96.57 | –0.08 |
| BLEURT | 82.06 | 81.71 | –0.35 |
| SARI | 57.17 | 56.79 | –0.38 |
| Entailment ratio | 95.49 | 98.02 | +2.53 |
| FKGL | 8.63 | 8.59 | –0.04 |
| # Sentences | 1.98 | 2.00 | +0.02 |
| Copy (%) | 2.48 | 0.72 | –1.76 |
The most pronounced gain is in the entailment ratio (+2.5 pp), directly tied to reduced hallucination. Copying of the input during inference drops sharply. Standard surface metrics (BLEU, SARI) remain largely unaffected, demonstrating that n-gram overlap is a poor proxy for semantic faithfulness (Tsukagoshi et al., 2024).
Comparative analyses confirm WikiSplit++-trained models outperform rule-based and zero-/few-shot GPT-3 baselines on split count and entailment. This suggests that data refinement, independent of model architecture, is a key determinant of downstream Split-and-Rephrase quality.
6. Recommendations and Usage Guidelines
Researchers are advised to:
- Employ NLI-based filtering when assembling Wikipedia-mined split-and-rephrase corpora to eliminate non-entailed splits.
- Randomize or systematically permute the order of target simple sentences in training to discourage trivial copying.
- Treat WikiSplit++ as a preferred source for high-quality, large-scale supervision in Split-and-Rephrase, reserving synthetic datasets like WebSplit for controlled evaluation.
The extraction pipeline is language-agnostic and can, in principle, be exported to other language versions of Wikipedia, subject to domain bias considerations (Botha et al., 2018). This suggests further expansion and adaptation opportunities for the split-and-rephrase paradigm.
7. Limitations and Future Directions
WikiSplit and WikiSplit++ both exhibit domain bias favoring Wikipedia topics (biographies, places), and the original WikiSplit is confined to binary splitting (two simple sentences per complex input). Despite refinement, residual errors may persist due to classifier limitations and dependency on automatic segmenters.
A plausible implication is the utility of integrating robust, semantics-aware losses or supplementary manual curation for deployment in faithfulness-critical NLP pipelines. As standard overlap metrics fail to capture semantic correctness, entailment-based scoring should be prioritized for future benchmarking and model selection.
WikiSplit remains the canonical large-scale benchmark for split-and-rephrase tasks and constitutes a foundational corpus for empirically-driven research in syntactic simplification and sentence decomposition (Botha et al., 2018, Tsukagoshi et al., 2024).