33GB Urdu Corpus for NLP
- 33GB Urdu Corpus is a comprehensive monolingual dataset for Urdu that compiles diverse web, OCR, and curated content for advanced NLP modeling.
- It employs advanced cleaning, normalization, and deduplication techniques to ensure high linguistic fidelity and reduce redundancy.
- Custom BPE tokenization with a 32K vocabulary optimizes model efficiency, reducing sequence lengths by up to 29% compared to multilingual tokenizers.
The 33GB Urdu Corpus is a comprehensive monolingual text resource assembled to support large-scale NLP and language modeling for Urdu, a language spoken by approximately 230 million people worldwide. Constructed as part of the UrduLM initiative, it addresses key deficits in existing multilingual and monolingual datasets by providing curated, diverse, and rigorously processed Urdu text suitable for pre-training transformer-based models in low-resource environments (Ali et al., 25 Jan 2026).
1. Corpus Sources and Data Composition
The corpus aggregates Urdu-language textual data from a multifaceted set of sources to ensure coverage of both formal and informal registers, literary genres, and contemporary web content. The proportional breakdown by source and volume is as follows:
| Source | % of Corpus | Volume (GB) |
|---|---|---|
| Common Crawl Dump 1 | 34.2 | 11.3 |
| Common Crawl Dump 2 | 24.5 | 8.1 |
| Machine-translated FineWeb | 16.7 | 5.5 |
| News Websites | 10.0 | 3.3 |
| UrduHack, FineWeb2 (open corpora) | 8.8 | 2.9 |
| OCR’d Books | 3.9 | 1.3 |
| Blogs and Forums | 1.8 | 0.6 |
Textual content includes web-scraped pages (including Wikipedia, poetry, and educational blogs—all subsumed under Common Crawl or Blogs), machine-translated web data, formal news articles, digitized books via OCR, and multiple open-access datasets. This diverse mixture is intended to minimize domain bias and reflect contemporary and literary Urdu usage (Ali et al., 25 Jan 2026).
2. Cleaning, Normalization, and Language Verification
The process pipeline applies sequential, rule-based and statistical cleaning steps to enforce data quality and linguistic fidelity:
- Regex-based noise removal: Eliminates artifacts such as URLs (
http[s]?://\S+), email addresses (\S+@\S+), phone numbers (\b\d{7,}\b), Latin script ([A-Za-z]+, if flagged), stray digits (\d+), and all HTML/XML tags. - Digit conversion: Converts ASCII numerals (Unicode U+0030–U+0039) to Urdu numerals (U+06F0–U+06F9).
- Character normalization: Uses an augmented UrduHack correction mapping to unify visually confusable codepoints (notably ARABIC LETTER YEH variants) and harmonize orthography.
- Word-spacing correction: Applies UrduHack rules to correct common errors in token boundaries caused by incorrect space insertion or omission.
- Unicode and symbol cleanup: Collapses whitespace runs (
\s+to' '), removes invisible Unicode (U+00A0, U+200B), prunes repeated punctuation (e.g.,؟؟؟؟to؟), and discards empty brackets/parentheses. - Language filtering: Employs a language identification model optimized for speed to remove any non-Urdu segments remaining after other cleaning operations.
This multi-phase normalization ensures token uniformity, typographic reliability, and high-purity Urdu content, essential for stable downstream modeling performance (Ali et al., 25 Jan 2026).
3. Deduplication and Redundancy Control
Redundancy is addressed using a multi-stage deduplication strategy:
- Shingling: Documents are decomposed into k-grams (), and a MinHash signature of fixed size is computed for each.
- Locality-Sensitive Hashing (LSH): Used to cluster candidate near-duplicate documents into buckets.
- Jaccard Similarity: For each pair in a bucket, calculate ; retain only the most representative document in pairs with .
- Manual Validation: A random subsample is manually checked to ensure low false-positive/false-negative rates in subtraction.
The result is a corpus with significantly reduced information redundancy, promoting statistical diversity and advanced LLM generalization (Ali et al., 25 Jan 2026).
4. Corpus Statistics and Structural Overview
The finalized corpus exhibits the following aggregate statistics:
- Total UTF-8 text size: 33 GB
- Document count: approximately 13 million
- Total tokens: approximately 5.5 billion
- Average document length: Calculated as tokens
- Final vocabulary (after BPE): 32,000 subword types
All data is formatted as a tabular CSV, with per-document fields:
id(UUID),text(normalized Urdu paragraph),source,category, andlength(post-BPE token count).- A parallel
metadata.jsonprovides sha256 checksums, original URLs when applicable, and timestamps for traceability.
This structure facilitates dataset streaming, traceable evaluation, and targeted sub-corpus extraction (Ali et al., 25 Jan 2026).
5. Custom Byte-Pair Encoding Tokenization
The corpus is paired with a script-aware custom BPE tokenizer optimized for the Perso-Arabic block (U+0600–U+06FF), Urdu numerals, and Urdu-specific punctuation. Three vocabulary sizes were explored: 10K, 20K, and 32K, with BPE merges set to approximately vocab_size minus the alphabet cardinality (∼32,000).
Tokenizer performance on a 500-word paragraph:
| Tokenizer | Fertility (tok/word) | Avg. Tokens | Overhead Reduction |
|---|---|---|---|
| GPT-4 o200k | 1.566 | 783 | 0% (baseline) |
| UrduLM-10k | 1.206 | 603 | 23% |
| UrduLM-32k | 1.110 | 555 | 29% |
UrduLM’s 32K-vocab tokenizer was adopted as standard, yielding a 20–30% sequence-length reduction versus multilingual alternatives and an embedding table of 32,000 × 768 parameters (24.6M). This suggests substantial gains in both computational efficiency and model fit for Urdu-centric tasks (Ali et al., 25 Jan 2026).
6. Directory Layout, File Formats, and Data Access
The corpus is organized with research workflow reproducibility as a priority, as follows:
raw/: Untouched input collections (CC dumps, OCR, raw HTML).cleaned/: Source-wise normalized.txtfiles.deduped/: Fully deduplicated variant.final/: Machine-readable CSV (urdu_corpus.csv) and metadata; contains all fields and per-paragraph<|EOT|>segmentation for model pretraining.tokenizer/: BPE model files (vocab_32k.model,vocab_32k.vocab) compatible with open-source libraries such as SentencePiece.
Pretraining and further research workflows can directly consume the CSV, with all pipeline artifacts (tokenizer, corpus splits, and benchmarks) openly provided for scrutiny and extension (Ali et al., 25 Jan 2026).
7. Licensing Terms and Usage Restrictions
Data and infrastructure licensing are as follows:
- FineWeb2 content: ODC-BY 1.0 (Open Data Commons Attribution)
- UrduHack toolkit: MIT License
- Processing scripts: Apache 2.0
- OCR and translation components: Governed by Google Cloud APIs’ research-only license
- All web scraping observes robots.txt and original site TOS; distribution is restricted to openly licensed or public domain sources
- Overall corpus: Strictly for research and educational purposes; commercial redistribution, especially of book scans, is prohibited
These conditions reflect the layered attribution and TOS responsibilities inherent to a composite resource sourcing from both open data and third-party APIs (Ali et al., 25 Jan 2026). A plausible implication is that reproducing the corpus beyond research settings would require careful re-audit of licensing at the source and code level.
Careful adherence to the documented source distribution, cleaning and deduplication protocols, tokenizer parameters, file structure, and usage constraints enables reproducibility, extension, or targeted retokenization for future Urdu NLP research (Ali et al., 25 Jan 2026).