Papers
Topics
Authors
Recent
Search
2000 character limit reached

FinerWeb-10BT Corpus: High-Quality LLM Data

Updated 15 February 2026
  • FinerWeb-10BT is a 10 billion-token English web corpus curated using advanced LLM-driven line-level filtering techniques.
  • It employs methods like deduplication, fastText language filtering, and GPT-4o mini annotations to remove low-utility and noisy fragments.
  • Empirical results show that using FinerWeb-10BT leads to up to 32% faster convergence and improved accuracy in downstream LLM training.

FinerWeb-10BT is a 10 billion-token English-language web dataset derived from FineWeb, representing a new standard for high-quality, LLM-oriented corpora by leveraging LLM-driven line-level filtering to excise low-utility and noisy textual fragments. Designed to improve both data efficiency and downstream model quality, FinerWeb-10BT refines and annotates a significant subset of FineWeb, itself a 15 trillion-token web-crawl corpus curated through advanced crawling, deduplication, and heuristic filtering (Henriksson et al., 13 Jan 2025, Penedo et al., 2024).

1. Origin and Underlying Data

FinerWeb-10BT is constructed as a random 10 billion-token subsample of the FineWeb corpus. FineWeb, detailed by Penedo et al., is compiled from 96 Common Crawl snapshots (2013–2024), with text extracted from WARC files using trafilatura v1.x and subjected to rigorous preprocessing. Core steps include:

  • Blocklisting of adult/pornographic domains via the UT1 “Capitole” blacklist
  • English-language filtering using fastText (score ≥ 0.65)
  • MassiveText-style heuristics targeting text quality and repetition
  • MinHash deduplication at the document level on 5-gram English tokens (H=112, B=14, h=8)
  • C4-inspired and custom heuristics, such as discarding documents with lines shorter than 30 characters or excessive duplicated text
  • PII anonymization for email addresses and IPs

The resulting FineWeb dataset, after deduplication and filtering, totals 15 trillion GPT-2 tokens, with the FineWeb-10BT subset composed as a uniform random sample of entire documents (≈15 million), preserving global proportionalities in content, domain, and snapshot coverage (Penedo et al., 2024).

2. LLM-Based Line-Level Quality Annotation

FinerWeb-10BT introduces a departure from conventional document- or line-level heuristics by employing a small LLM, GPT-4o mini, for granular line-level quality annotation:

  • A 20,000-document sample (≈328,472 lines) was segmented into batches of up to 15 consecutive lines; lines exceeding 200 characters were further split at sentence boundaries.
  • Each line was labeled by GPT-4o mini as either “Clean” or by generating a descriptive low-quality tag (e.g., “HTML tag”, “copyright notice”, “phone number”, “programming code”).
  • This dynamic, evolving label list grew as new artifacts were detected, with periodic shuffling to prevent categorization bias.

After annotation, 547 distinct low-quality tags were identified, with 83% (274,343) of lines marked “Clean.” Rare or misassigned tags were further audited and, where appropriate, collapsed back into “Clean” or grouped under higher-level categories following iterative manual correction and OpenAI o1-preview clustering. Human validation over a 50-document sample (726 lines) yielded an average Cohen’s κ of 0.70 across classes (0.73 binary Clean/Non-Clean), reflecting moderate-to-strong annotator agreement (Henriksson et al., 13 Jan 2025).

3. Quality Taxonomy and Category Distribution

Consolidation of the rich label set yielded a nine-way quality taxonomy:

Category Lines % of Sample
Clean 283,267 86.24%
Formatting / Style Errors 13,150 4.00%
Bibliographical / Citation References 8,768 2.67%
Promotional / Spam Content 7,339 2.23%
Contact / Identification Information 3,898 1.19%
Navigation / Interface Elements 3,327 1.01%
Technical Specifications / Metadata 3,298 1.00%
Legal / Administrative Content 2,992 0.91%
Offensive / Inappropriate Content 2,433 0.74%

Example mappings include copyright notices and bank disclaimers as “Legal/Administrative,” press releases as “Promotional/Spam,” and navigation bars as “Navigation/Interface.” This taxonomy underpins the fine-grained filtering strategy central to FinerWeb-10BT’s quality improvements (Henriksson et al., 13 Jan 2025).

4. Scaling Line-Level Filtering Across the Corpus

To extend fine-grained annotation from the ≈0.3M-line sample to the entire 10B-token corpus, a DeBERTa-v3-base classifier was trained and deployed:

  • Model architecture: DeBERTa-v3-base with a linear classification head; each line treated independently.
  • Objective: Cross-entropy loss with label smoothing (ϵ=0.1\epsilon=0.1).
  • Micro F1 on held-out test: 0.81; Clean class F1: 0.90 (Precision 0.88, Recall 0.91).
  • Comparative models (DeBERTa-v3-large, Stella-en-400M-v5, XLM-RoBERTa-base) yielded similar performance.

Confusions were skewed towards false positives for “Clean,” minimizing the risk of discarding high-quality content. To enable scalable and tunable filtering, predicted class probabilities pkp_k were collapsed to a calibrated Clean score Pcal(y=Cleanx)P_{\mathrm{cal}}(y=\mathrm{Clean}\,|\,x) using Platt scaling:

Pcal(y=Cleanx)=[1+exp(Alogitclean+B)]1P_{\mathrm{cal}}(y=\mathrm{Clean}|x) = \left[1 + \exp(A \cdot \mathrm{logit}_{\mathrm{clean}} + B)\right]^{-1}

where parameters AA and BB were learned on held-out classifier outputs.

Thresholds were set at $0.50$ (removing 8% of lines, yielding \approx9.2B tokens) and $0.90$ (removing 25%, \approx7.5B tokens), enabling flexible trade-offs between dataset size and stringency. Each document in the filtered data is annotated with a per-line quality_score list (Henriksson et al., 13 Jan 2025).

5. Downstream LLM Training and Evaluation

The impact of LLM-driven filtering was empirically validated through downstream GPT-2 (124M) pretraining and evaluation on HellaSwag:

  • Models were trained from scratch on: (1) original FineWeb-10BT, (2) FinerWeb-10BT-0.50, and (3) FinerWeb-10BT-0.90.
  • All runs used 4×A100 GPUs, 18,994 steps (single epoch), and were repeated five times to reduce variance.
  • Filtered datasets enabled models to reach the original’s HellaSwag peak accuracy approximately 6,000 steps (32%) sooner, saving ≈1 hour 45 minutes of wall-clock time per run.
  • Both filtered sets outperformed unfiltered data by ≈0.10 points in final HellaSwag accuracy.
  • The more aggressive 0.90 threshold provided a slight efficiency and accuracy gain, indicating potential further improvements at even higher selectivity (Henriksson et al., 13 Jan 2025).

6. Methodological Significance and Practical Implications

FinerWeb-10BT operationalizes the paradigm of LLM-in-the-loop data curation, demonstrating several key advances:

  • Traditional document- and line-level heuristics (e.g., length thresholds) are coarse and risk excising valuable text or retaining noisier boilerplate.
  • LLM annotation generates hundreds of detailed “low quality” descriptors, which can be merged into a compact, interpretable taxonomy.
  • A compact encoder classifier can transfer these judgments efficiently to billions of lines, producing per-line quality scores suitable for calibration and thresholding.
  • By retaining only lines most likely to be genuinely natural, fluent English—as opposed to metadata, spam, or interface detritus—filtered data yields both “denser” corpora and faster, more efficient LLM training. Models not only generalize better but do so with reduced data, compute, and associated environmental cost.

The FinerWeb-10BT corpus, together with code and per-line quality metadata, supports open, reproducible research into scalable LLM pretraining workflows, particularly where data efficiency and reduced energy consumption are becoming pressing constraints (Henriksson et al., 13 Jan 2025).

7. Relationship to FineWeb and Other Datasets

FinerWeb-10BT inherits its foundational data and filtering pipeline from FineWeb. FineWeb itself implements more elaborate and transparent curation, deduplication, and heuristic filtering than prior open web-scale corpora, outperforming RefinedWeb, C4, The Pile, and others on LLM downstream tasks. The FinerWeb-10BT approach, by focusing on in-depth line-level semantics rather than purely structural or heuristic signals, advances the state of dataset refinement, complementing existing ablation-based and large-sample curation frameworks (Penedo et al., 2024).

A plausible implication is that as dataset construction approaches the saturation regime in both scale and surface-level filtering, sub-document LLM-based annotation and filtering may yield further non-trivial gains in data quality and training efficiency for general-domain LLMs. FinerWeb-10BT provides an empirically validated foundation for this claim.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FineWeb-10BT Corpus.