Papers
Topics
Authors
Recent
Search
2000 character limit reached

WikiSplit Dataset: Sentence Rewriting Corpus

Updated 8 January 2026
  • WikiSplit is a large-scale, naturally occurring split-and-rephrase corpus derived from Wikipedia edit histories that provides over one million sentence split instances.
  • The dataset is constructed using heuristic extraction with trigram overlap and BLEU filtering, ensuring naturalness while inherently containing label noise.
  • Refinements in WikiSplit++—using NLI filtering and sentence-order reversal—yield improved semantic entailment and reduced copying in neural sequence-to-sequence models.

WikiSplit is a large-scale, naturally occurring split-and-rephrase corpus constructed from Wikipedia edit histories. It provides over one million sentence split instances for training and evaluating neural sequence-to-sequence models in NLP, particularly for the task of decomposing complex sentences into sets of simpler sentences with equivalent semantics. WikiSplit exhibits substantial gains in corpus size and lexical coverage relative to prior datasets such as WebSplit, and its extraction protocol ensures diversity and naturalness by mining actual Wikipedia editorial revisions rather than synthetic templates. However, the corpus natively contains label noise due to automated mining and semantic mismatch, driving subsequent refinement efforts such as WikiSplit++ that aim to improve faithfulness and utility.

1. Data Construction and Extraction Protocol

WikiSplit is extracted automatically from Wikipedia edit histories, leveraging the revision trails that capture changes from one snapshot tt to t+1t+1. The extraction pipeline involves segmenting both timepoints into sentences, identifying cases where an editor replaces a single complex sentence CC with two or more simple sentences S1,,SnS_1,\ldots,S_n in a new revision. Heuristic matching employs trigram overlap criteria to ensure contiguous coverage:

  • The first three trigrams of CC must appear in S1S_1, and the last three in SnS_n.
  • The trigram suffix of S1S_1 must differ from that of SnS_n to avoid trivial self-joins.
  • Minimum BLEU threshold δ=0.2\delta=0.2 ensures surface similarity between CC and candidate splits.

Filtering removes instances with excessive token repetitions, anomalous token length, and profanity/vandalism. For each complex sentence, only the highest summed BLEU split pair is retained. All extracted split instances in WikiSplit are binary (n=2n=2), meaning each complex input is split into exactly two simple sentences (Botha et al., 2018).

2. Corpus Statistics and Dataset Organization

WikiSplit achieves an order-of-magnitude scale expansion over previous benchmarks. Its statistics are as follows:

Partition # Instances
train 795,585
dev 99,448
test 99,448
overall 994,481

Token-level statistics indicate the mean length of CC is approximately 33.1 tokens, while simple sentences average 11.0 tokens each. Distinct splits and vocabulary are much higher than in synthetic WebSplit datasets, achieving 60× more examples and 90× more distinct lexical items. Examples display naturalistic phrasing and topic diversity reflective of Wikipedia content.

3. Noisy Labels and Dataset Limitations

Despite rigorous surface-form heuristics, WikiSplit contains significant label noise:

  • Hallucination (extrinsic): Some simple sentences SiS_i introduce unverifiable facts or paraphrases not entailed by CC, including erroneous dates, extraneous modifiers, or additional clauses.
  • Under-splitting: Models trained on WikiSplit may learn to copy input CC verbatim, minimizing loss without substantive splitting, due to rare punctuation and unrewarded split points.
  • Unsupported/Missing Facts: Analysis confirms that approximately 32–40% of instances do not strictly satisfy CSiC \models S_i for every ii, meaning semantic entailment is not guaranteed (Tsukagoshi et al., 2024).

This results in model training artifacts and limits deployment in faithfulness-critical tasks.

4. Data Refinement via WikiSplit++

WikiSplit++ introduces a two-step refinement of the original corpus:

a. NLI Filtering Each (C,Si)(C, S_i) pair is passed through a trained DeBERTa-v2 XXL natural language inference (NLI) classifier (fine-tuned on MNLI). Instances are retained iff for all ii:

Pent >max(Pneu,Pcon)P_{\text{ent}}\ > \max(P_{\text{neu}}, P_{\text{con}})

where PentP_{\text{ent}}, PneuP_{\text{neu}}, and PconP_{\text{con}} are predicted entailment, neutral, and contradiction probabilities respectively.

b. Sentence-Order Reversing To counteract copying and encourage genuine split learning, the gold reference order of the simple sentences (S1,,Sn)(S_1,\ldots,S_n) is reversed to (Sn,,S1)(S_n,\ldots,S_1) after NLI filtering. Sentence boundaries are detected using PySBD prior to tokenization.

The resulting WikiSplit++ corpus contains 630,433 high-quality split instances, having removed ≈36.6% of originals due to failed entailment (Tsukagoshi et al., 2024).

5. Empirical Evaluation and Impact

Training sequence-to-sequence models (e.g., T5-small) on the WikiSplit++ corpus yields marked improvements in semantic faithfulness and reduction of hallucination. Key metrics from HSplit benchmark evaluation demonstrate:

Metric WikiSplit WikiSplit++ Δ
BLEU 87.95 88.06 +0.11
BERTScore 96.65 96.57 –0.08
BLEURT 82.06 81.71 –0.35
SARI 57.17 56.79 –0.38
Entailment ratio 95.49 98.02 +2.53
FKGL 8.63 8.59 –0.04
# Sentences 1.98 2.00 +0.02
Copy (%) 2.48 0.72 –1.76

The most pronounced gain is in the entailment ratio (+2.5 pp), directly tied to reduced hallucination. Copying of the input during inference drops sharply. Standard surface metrics (BLEU, SARI) remain largely unaffected, demonstrating that n-gram overlap is a poor proxy for semantic faithfulness (Tsukagoshi et al., 2024).

Comparative analyses confirm WikiSplit++-trained models outperform rule-based and zero-/few-shot GPT-3 baselines on split count and entailment. This suggests that data refinement, independent of model architecture, is a key determinant of downstream Split-and-Rephrase quality.

6. Recommendations and Usage Guidelines

Researchers are advised to:

  • Employ NLI-based filtering when assembling Wikipedia-mined split-and-rephrase corpora to eliminate non-entailed splits.
  • Randomize or systematically permute the order of target simple sentences in training to discourage trivial copying.
  • Treat WikiSplit++ as a preferred source for high-quality, large-scale supervision in Split-and-Rephrase, reserving synthetic datasets like WebSplit for controlled evaluation.

The extraction pipeline is language-agnostic and can, in principle, be exported to other language versions of Wikipedia, subject to domain bias considerations (Botha et al., 2018). This suggests further expansion and adaptation opportunities for the split-and-rephrase paradigm.

7. Limitations and Future Directions

WikiSplit and WikiSplit++ both exhibit domain bias favoring Wikipedia topics (biographies, places), and the original WikiSplit is confined to binary splitting (two simple sentences per complex input). Despite refinement, residual errors may persist due to classifier limitations and dependency on automatic segmenters.

A plausible implication is the utility of integrating robust, semantics-aware losses or supplementary manual curation for deployment in faithfulness-critical NLP pipelines. As standard overlap metrics fail to capture semantic correctness, entailment-based scoring should be prioritized for future benchmarking and model selection.

WikiSplit remains the canonical large-scale benchmark for split-and-rephrase tasks and constitutes a foundational corpus for empirically-driven research in syntactic simplification and sentence decomposition (Botha et al., 2018, Tsukagoshi et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WikiSplit Dataset.