Papers
Topics
Authors
Recent
Search
2000 character limit reached

Common Pile v0.1 Open Text Corpus

Updated 24 January 2026
  • Common Pile v0.1 is a large-scale, openly licensed English text corpus aggregated from 30 rigorously verified sources across 10 domains.
  • It is meticulously filtered, deduplicated, and processed into 1 trillion tokens to meet stringent quality, legal, and ethical standards.
  • The dataset enables robust LLM pretraining, with 7B models achieving competitive NLP and code generation benchmarks.

Common Pile v0.1 is an openly licensed, large-scale English-language text corpus developed to support the pretraining of LLMs while addressing longstanding intellectual property and ethical challenges inherent in prior datasets. Comprising approximately 8 TB of filtered, deduplicated, high-quality text spanning 30 sources and 10 distinct domains, Common Pile v0.1 is explicitly constructed to include only public domain and openly licensed content recognized under the Open Knowledge Definition. The dataset’s breadth covers scientific literature, code, government publications, books, encyclopedic content, technical documentation, online discussions, legal and parliamentary corpora, educational materials, and curated audio transcripts. Common Pile v0.1 is validated by training and releasing two 7B-parameter LLMs (Comma v0.1-1T and Comma v0.1-2T), which reach competitive performance benchmarks on standard NLP evaluations against models trained on noncompliant or unverified data (Kandpal et al., 5 Jun 2025).

1. Composition and Source Selection

The Common Pile v0.1 corpus aggregates 30 rigorously verified sources mapped to 10 thematic domains: Scientific, Online Discussion, Government Data/Legal Texts, Book (Public Domain), Open Educational Resources, Wikis, Code, Audio Transcripts, Web Text (“CCCC”), and Curated Tasks. Each source was retained only if per-document licenses conform to the Open Knowledge Definition, such as CC BY, CC BY-SA, CC0, MIT, Apache 2.0, BSD, Open Parliament License, and unequivocal public domain releases. Ambiguous or unverified sources—including synthetic (LLM-generated) datasets and OpenAlex—were excluded (Kandpal et al., 5 Jun 2025).

The table below summarizes representative sources and statistics, drawn directly from the dataset’s report:

| Domain | Example Source | Documents (|Dᵢ|) | Size (Sᵢ, GB) | Tokens (Tᵢ, B) | |------------------------|------------------------------|------------|-------------|---------------| | Scientific Texts | peS2o | 6,117,280 | 182.6 | 273.9 | | | PubMed | 3,829,689 | 147.1 | 36.8 | | | arXiv full, abstracts | 2,808,727 | 21.9 | 32.9 | | Online Discussion | StackExchange | 30,987,814 | 89.7 | 134.6 | | | GitHub Issues (FOSS) | 23,358,580 | 40.4 | 60.6 | | | Ubuntu IRC | 234,982 | 5.3 | 7.9 | | Government/Legal | USGPO | 2,148,548 | 36.1 | 2.3 | | | USPTO patents |17,030,231 | 661.1 | 41.3 | | Book (Public Domain) | BHL, Pre-1929, LoC, Gutenberg| 15,420,717 | 210.2 | 21.0 | | Open Ed Resources | DOAB, PressBooks, OERCommons | 503,696 | 13.2 | 24.4 | | Wikis | Wikimedia, Wikiteam | 43,243,381 | 71.1 | 99.8 | | Code | Stack V2, Python PEPs |69,589,262 | 2,599.0 | 130.0 | | Audio Transcripts | CC-BY YouTube (Whisper) | 998,104 | 18.6 | 4.7 | | Web Text (“CCCC”) | CCCC, Foodista, News | 7,045,856 | 58.4 | 87.6 | | Curated Tasks | DPI Supervised | 3,508,518 | 3.5 | 5.1 |

The aggregate dataset size after filtering is approximately 1,838 GB, corresponding to 1 trillion tokens for the published “Comma” LLM runs (Kandpal et al., 5 Jun 2025).

Every document in Common Pile v0.1 is furnished with explicit license metadata at the document level. The selection protocol mandates each included source to satisfy the Open Knowledge Definition, such that the corpus can be redistributed, remixed, and used for both research and downstream deployment within the licensing constraints. Mitigation of “license laundering” is performed via several mechanisms:

  • Direct acquisition from rights-holders or from authoritative dumps (e.g., arXiv’s S3, PubMed Central FTP).
  • Exclusion of any source if provenance, downstream license, or upstream rights could not be fully verified.
  • Manual audit of high-volume domains—for example, verifying rights for Common Crawl‐filtered content.
  • Preservation of license information throughout downstream pipelines.

This approach aims to address issues of IP infringement and ambiguous usage rights that have historically complicated the use of major LLM datasets (Kandpal et al., 5 Jun 2025).

3. Data Collection, Processing, and Quality Control

The data pipeline is modular and source-specific, with the following global structure:

a) Scraping and Parsing:

  • PDFs are processed to XML using Grobid (e.g., for peS2o, DOAB), and further to HTML via LaTeXML (arXiv).
  • XML/HTML converted to plaintext using Trafilatura, Pandoc, Marker, Resiliparse, or custom regular expressions.
  • Markdown processed with PyMarkdown.
  • Custom pipelines applied for code, books, and legal documents.

b) Language and Content Filtering:

  • Only English (FastText probability >0.5>0.5).
  • Document length thresholds (e.g. #words100\#\mathrm{words} \geq 100) are source-specific.
  • DataComp-LM quality score q>104q > 10^{-4} on web sources.
  • Unigram log-likelihood 20\ell \geq -20 for specific sources (e.g., peS2o, GT, LoC, USPTO).
  • Toxicity filtering via FastText (<0.1<0.1).
  • Regex-based redaction and normalization for personally identifiable information (PII).

c) Boilerplate and Deduplication:

  • Source-specific regex to clean repetitive boilerplate (e.g., legal disclaimers, headers, footers).
  • Global fuzzy deduplication using a Bloom filter on nn-gram (n=20n=20) overlaps exceeding 90%.
  • Explicit deduplication of near-duplicates and removal of documents that are synthetic or machine-generated where identified.

This process was designed to maximize coverage, maintain data integrity, and systematically exclude low-quality, toxic, or redundant content (Kandpal et al., 5 Jun 2025).

4. Tokenization and Model Preparation

Tokenization is implemented using a custom byte-pair encoding (BPE) scheme, trained on a 600 GB English-language sample drawn from the filtered Common Pile corpus:

  • Vocabulary size V=64,000V = 64{,}000 tokens.
  • Implementation: Hugging Face “tokenizers” library with ByteLevel preprocessing; regex compatible with Llama 3.2 splitting rules.
  • No Unicode normalization beyond whitespace homogenization and code-point decomposition.

For LLM pretraining, documents are concatenated and split into chunks of length L=4096L = 4096 tokens, each bracketed by BOS/EOS tokens as appropriate for Transformer architectures (Kandpal et al., 5 Jun 2025).

5. Model Validation and Benchmark Results

To validate the dataset, two 7B-parameter decoder-only Transformer models, Comma v0.1-1T (1T tokens) and Comma v0.1-2T (2T tokens), are trained. The Comma family follows the Llama-7B reference architecture (32 layers, width 4096, 32 heads):

  • Optimization: AdamW, weight decay 0.2, dropout 0.1, staged learning rate schedule with cosine annealing and “cool-down” on a high-quality subset.
  • Training mixture weights wiw_i are set from per-source validation LM loss, enforcing sampling probabilities pi=wi/jwjp_i = w_i / \sum_j w_j.
  • No synthetic (LLM-generated) content is included.

Evaluation is performed on standard NLP and code benchmarks (ARC-Challenge, MMLU, HellaSwag, HumanEval, among others):

Model MMLU ARC-C HellaSwag HumanEval
Llama 1 7B 34.8 44.5 76.2 19.9
MPT-7B 30.2 46.5 77.6 27.3
StableLM-7B 45.2 50.8 75.6 23.1
Comma v0.1-1T 42.4 52.8 62.6 36.5

Comma v0.1-1T outperforms compute-matched open LLMs on most knowledge-focused tasks and surpasses prior work on code generation (HumanEval) (Kandpal et al., 5 Jun 2025). Comma v0.1-2T remains competitive with other 2T-token 7B models (e.g., Llama 2 7B, OLMo Twin 2T).

6. Limitations, Ethics, and Best Practices

Common Pile v0.1 is subject to inherent limitations:

  • Domain and source imbalance: a large fraction of tokens is contributed by scientific, code, and Wiki-style text; certain genres may have comparatively reduced representation.
  • Noise, artifacts, and imperfect filtering are present due to the dataset’s scale; source-level cleaning attempts to ameliorate, but not eliminate, such issues.
  • Residual near-duplicates and possible minor PII leakage may remain, despite systematic deduplication and redaction.
  • No non-English content: the dataset is monolingual.
  • Downstream users are advised to utilize per-document license metadata, respect all attribution requirements, and avoid secondary redistribution in contravention of original license terms.

The curation methodology explicitly rejects synthetic data and ambiguous sources and recommends verification of licensing at origin, preservation of license metadata, and clear attribution (Kandpal et al., 5 Jun 2025).

7. Availability and Reproducibility

Both raw and filtered Common Pile v0.1 datasets, along with full preprocessing pipelines and data mixture specifications, are publicly available:

A user can reconstruct the full pipeline using the released scripts, ensuring the reproducibility of results and transparent processing at every stage. This supports robust LLM benchmarking and further research into the effects of fully open and license-compliant training data (Kandpal et al., 5 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Common Pile v0.1 Dataset.