Common Pile v0.1 Open Text Corpus
- Common Pile v0.1 is a large-scale, openly licensed English text corpus aggregated from 30 rigorously verified sources across 10 domains.
- It is meticulously filtered, deduplicated, and processed into 1 trillion tokens to meet stringent quality, legal, and ethical standards.
- The dataset enables robust LLM pretraining, with 7B models achieving competitive NLP and code generation benchmarks.
Common Pile v0.1 is an openly licensed, large-scale English-language text corpus developed to support the pretraining of LLMs while addressing longstanding intellectual property and ethical challenges inherent in prior datasets. Comprising approximately 8 TB of filtered, deduplicated, high-quality text spanning 30 sources and 10 distinct domains, Common Pile v0.1 is explicitly constructed to include only public domain and openly licensed content recognized under the Open Knowledge Definition. The dataset’s breadth covers scientific literature, code, government publications, books, encyclopedic content, technical documentation, online discussions, legal and parliamentary corpora, educational materials, and curated audio transcripts. Common Pile v0.1 is validated by training and releasing two 7B-parameter LLMs (Comma v0.1-1T and Comma v0.1-2T), which reach competitive performance benchmarks on standard NLP evaluations against models trained on noncompliant or unverified data (Kandpal et al., 5 Jun 2025).
1. Composition and Source Selection
The Common Pile v0.1 corpus aggregates 30 rigorously verified sources mapped to 10 thematic domains: Scientific, Online Discussion, Government Data/Legal Texts, Book (Public Domain), Open Educational Resources, Wikis, Code, Audio Transcripts, Web Text (“CCCC”), and Curated Tasks. Each source was retained only if per-document licenses conform to the Open Knowledge Definition, such as CC BY, CC BY-SA, CC0, MIT, Apache 2.0, BSD, Open Parliament License, and unequivocal public domain releases. Ambiguous or unverified sources—including synthetic (LLM-generated) datasets and OpenAlex—were excluded (Kandpal et al., 5 Jun 2025).
The table below summarizes representative sources and statistics, drawn directly from the dataset’s report:
| Domain | Example Source | Documents (|Dᵢ|) | Size (Sᵢ, GB) | Tokens (Tᵢ, B) | |------------------------|------------------------------|------------|-------------|---------------| | Scientific Texts | peS2o | 6,117,280 | 182.6 | 273.9 | | | PubMed | 3,829,689 | 147.1 | 36.8 | | | arXiv full, abstracts | 2,808,727 | 21.9 | 32.9 | | Online Discussion | StackExchange | 30,987,814 | 89.7 | 134.6 | | | GitHub Issues (FOSS) | 23,358,580 | 40.4 | 60.6 | | | Ubuntu IRC | 234,982 | 5.3 | 7.9 | | Government/Legal | USGPO | 2,148,548 | 36.1 | 2.3 | | | USPTO patents |17,030,231 | 661.1 | 41.3 | | Book (Public Domain) | BHL, Pre-1929, LoC, Gutenberg| 15,420,717 | 210.2 | 21.0 | | Open Ed Resources | DOAB, PressBooks, OERCommons | 503,696 | 13.2 | 24.4 | | Wikis | Wikimedia, Wikiteam | 43,243,381 | 71.1 | 99.8 | | Code | Stack V2, Python PEPs |69,589,262 | 2,599.0 | 130.0 | | Audio Transcripts | CC-BY YouTube (Whisper) | 998,104 | 18.6 | 4.7 | | Web Text (“CCCC”) | CCCC, Foodista, News | 7,045,856 | 58.4 | 87.6 | | Curated Tasks | DPI Supervised | 3,508,518 | 3.5 | 5.1 |
The aggregate dataset size after filtering is approximately 1,838 GB, corresponding to 1 trillion tokens for the published “Comma” LLM runs (Kandpal et al., 5 Jun 2025).
2. Licensing and Legal Compliance
Every document in Common Pile v0.1 is furnished with explicit license metadata at the document level. The selection protocol mandates each included source to satisfy the Open Knowledge Definition, such that the corpus can be redistributed, remixed, and used for both research and downstream deployment within the licensing constraints. Mitigation of “license laundering” is performed via several mechanisms:
- Direct acquisition from rights-holders or from authoritative dumps (e.g., arXiv’s S3, PubMed Central FTP).
- Exclusion of any source if provenance, downstream license, or upstream rights could not be fully verified.
- Manual audit of high-volume domains—for example, verifying rights for Common Crawl‐filtered content.
- Preservation of license information throughout downstream pipelines.
This approach aims to address issues of IP infringement and ambiguous usage rights that have historically complicated the use of major LLM datasets (Kandpal et al., 5 Jun 2025).
3. Data Collection, Processing, and Quality Control
The data pipeline is modular and source-specific, with the following global structure:
a) Scraping and Parsing:
- PDFs are processed to XML using Grobid (e.g., for peS2o, DOAB), and further to HTML via LaTeXML (arXiv).
- XML/HTML converted to plaintext using Trafilatura, Pandoc, Marker, Resiliparse, or custom regular expressions.
- Markdown processed with PyMarkdown.
- Custom pipelines applied for code, books, and legal documents.
b) Language and Content Filtering:
- Only English (FastText probability ).
- Document length thresholds (e.g. ) are source-specific.
- DataComp-LM quality score on web sources.
- Unigram log-likelihood for specific sources (e.g., peS2o, GT, LoC, USPTO).
- Toxicity filtering via FastText ().
- Regex-based redaction and normalization for personally identifiable information (PII).
c) Boilerplate and Deduplication:
- Source-specific regex to clean repetitive boilerplate (e.g., legal disclaimers, headers, footers).
- Global fuzzy deduplication using a Bloom filter on -gram () overlaps exceeding 90%.
- Explicit deduplication of near-duplicates and removal of documents that are synthetic or machine-generated where identified.
This process was designed to maximize coverage, maintain data integrity, and systematically exclude low-quality, toxic, or redundant content (Kandpal et al., 5 Jun 2025).
4. Tokenization and Model Preparation
Tokenization is implemented using a custom byte-pair encoding (BPE) scheme, trained on a 600 GB English-language sample drawn from the filtered Common Pile corpus:
- Vocabulary size tokens.
- Implementation: Hugging Face “tokenizers” library with ByteLevel preprocessing; regex compatible with Llama 3.2 splitting rules.
- No Unicode normalization beyond whitespace homogenization and code-point decomposition.
For LLM pretraining, documents are concatenated and split into chunks of length tokens, each bracketed by BOS/EOS tokens as appropriate for Transformer architectures (Kandpal et al., 5 Jun 2025).
5. Model Validation and Benchmark Results
To validate the dataset, two 7B-parameter decoder-only Transformer models, Comma v0.1-1T (1T tokens) and Comma v0.1-2T (2T tokens), are trained. The Comma family follows the Llama-7B reference architecture (32 layers, width 4096, 32 heads):
- Optimization: AdamW, weight decay 0.2, dropout 0.1, staged learning rate schedule with cosine annealing and “cool-down” on a high-quality subset.
- Training mixture weights are set from per-source validation LM loss, enforcing sampling probabilities .
- No synthetic (LLM-generated) content is included.
Evaluation is performed on standard NLP and code benchmarks (ARC-Challenge, MMLU, HellaSwag, HumanEval, among others):
| Model | MMLU | ARC-C | HellaSwag | HumanEval |
|---|---|---|---|---|
| Llama 1 7B | 34.8 | 44.5 | 76.2 | 19.9 |
| MPT-7B | 30.2 | 46.5 | 77.6 | 27.3 |
| StableLM-7B | 45.2 | 50.8 | 75.6 | 23.1 |
| Comma v0.1-1T | 42.4 | 52.8 | 62.6 | 36.5 |
Comma v0.1-1T outperforms compute-matched open LLMs on most knowledge-focused tasks and surpasses prior work on code generation (HumanEval) (Kandpal et al., 5 Jun 2025). Comma v0.1-2T remains competitive with other 2T-token 7B models (e.g., Llama 2 7B, OLMo Twin 2T).
6. Limitations, Ethics, and Best Practices
Common Pile v0.1 is subject to inherent limitations:
- Domain and source imbalance: a large fraction of tokens is contributed by scientific, code, and Wiki-style text; certain genres may have comparatively reduced representation.
- Noise, artifacts, and imperfect filtering are present due to the dataset’s scale; source-level cleaning attempts to ameliorate, but not eliminate, such issues.
- Residual near-duplicates and possible minor PII leakage may remain, despite systematic deduplication and redaction.
- No non-English content: the dataset is monolingual.
- Downstream users are advised to utilize per-document license metadata, respect all attribution requirements, and avoid secondary redistribution in contravention of original license terms.
The curation methodology explicitly rejects synthetic data and ambiguous sources and recommends verification of licensing at origin, preservation of license metadata, and clear attribution (Kandpal et al., 5 Jun 2025).
7. Availability and Reproducibility
Both raw and filtered Common Pile v0.1 datasets, along with full preprocessing pipelines and data mixture specifications, are publicly available:
- Data: https://huggingface.co/collections/common-pile/common-pile-v01-raw-data-…
- Code repository: https://github.com/r-three/common-pile
- Model artifacts (Comma v0.1 models): https://huggingface.co/collections/common-pile/comma-v01-artifacts-…
A user can reconstruct the full pipeline using the released scripts, ensuring the reproducibility of results and transparent processing at every stage. This supports robust LLM benchmarking and further research into the effects of fully open and license-compliant training data (Kandpal et al., 5 Jun 2025).