33GB Urdu Corpus for NLP

Updated 1 February 2026

33GB Urdu Corpus is a comprehensive monolingual dataset for Urdu that compiles diverse web, OCR, and curated content for advanced NLP modeling.
It employs advanced cleaning, normalization, and deduplication techniques to ensure high linguistic fidelity and reduce redundancy.
Custom BPE tokenization with a 32K vocabulary optimizes model efficiency, reducing sequence lengths by up to 29% compared to multilingual tokenizers.

The 33GB Urdu Corpus is a comprehensive monolingual text resource assembled to support large-scale NLP and language modeling for Urdu, a language spoken by approximately 230 million people worldwide. Constructed as part of the UrduLM initiative, it addresses key deficits in existing multilingual and monolingual datasets by providing curated, diverse, and rigorously processed Urdu text suitable for pre-training transformer-based models in low-resource environments (Ali et al., 25 Jan 2026).

1. Corpus Sources and Data Composition

The corpus aggregates Urdu-language textual data from a multifaceted set of sources to ensure coverage of both formal and informal registers, literary genres, and contemporary web content. The proportional breakdown by source and volume is as follows:

Source	% of Corpus	Volume (GB)
Common Crawl Dump 1	34.2	11.3
Common Crawl Dump 2	24.5	8.1
Machine-translated FineWeb	16.7	5.5
News Websites	10.0	3.3
UrduHack, FineWeb2 (open corpora)	8.8	2.9
OCR’d Books	3.9	1.3
Blogs and Forums	1.8	0.6

Textual content includes web-scraped pages (including Wikipedia, poetry, and educational blogs—all subsumed under Common Crawl or Blogs), machine-translated web data, formal news articles, digitized books via OCR, and multiple open-access datasets. This diverse mixture is intended to minimize domain bias and reflect contemporary and literary Urdu usage (Ali et al., 25 Jan 2026).

2. Cleaning, Normalization, and Language Verification

The process pipeline applies sequential, rule-based and statistical cleaning steps to enforce data quality and linguistic fidelity:

Regex-based noise removal: Eliminates artifacts such as URLs (http[s]?://\S+), email addresses (\S+@\S+), phone numbers (\b\d{7,}\b), Latin script ([A-Za-z]+, if flagged), stray digits (\d+), and all HTML/XML tags.
Digit conversion: Converts ASCII numerals (Unicode U+0030–U+0039) to Urdu numerals (U+06F0–U+06F9).
Character normalization: Uses an augmented UrduHack correction mapping to unify visually confusable codepoints (notably ARABIC LETTER YEH variants) and harmonize orthography.
Word-spacing correction: Applies UrduHack rules to correct common errors in token boundaries caused by incorrect space insertion or omission.
Unicode and symbol cleanup: Collapses whitespace runs (\s+ to ' '), removes invisible Unicode (U+00A0, U+200B), prunes repeated punctuation (e.g., ؟؟؟؟ to ؟), and discards empty brackets/parentheses.
Language filtering: Employs a language identification model optimized for speed to remove any non-Urdu segments remaining after other cleaning operations.

This multi-phase normalization ensures token uniformity, typographic reliability, and high-purity Urdu content, essential for stable downstream modeling performance (Ali et al., 25 Jan 2026).

3. Deduplication and Redundancy Control

Redundancy is addressed using a multi-stage deduplication strategy:

Shingling: Documents are decomposed into k-grams ( $k=5$ ), and a MinHash signature of fixed size $s$ is computed for each.
Locality-Sensitive Hashing (LSH): Used to cluster candidate near-duplicate documents into buckets.
Jaccard Similarity: For each pair $(A,B)$ in a bucket, calculate $J(A,B) = \frac{|\mathrm{shingles}(A) \cap \mathrm{shingles}(B)|}{|\mathrm{shingles}(A) \cup \mathrm{shingles}(B)|}$ ; retain only the most representative document in pairs with $J(A,B) > 0.90$ .
Manual Validation: A random subsample is manually checked to ensure low false-positive/false-negative rates in subtraction.

The result is a corpus with significantly reduced information redundancy, promoting statistical diversity and advanced LLM generalization (Ali et al., 25 Jan 2026).

4. Corpus Statistics and Structural Overview

The finalized corpus exhibits the following aggregate statistics:

Total UTF-8 text size: 33 GB
Document count: approximately 13 million
Total tokens: approximately 5.5 billion
Average document length: Calculated as $\mathrm{AvgLen} = \frac{5.5 \times 10^9}{13 \times 10^6} \approx 423$ tokens
Final vocabulary (after BPE): 32,000 subword types

All data is formatted as a tabular CSV, with per-document fields:

id (UUID), text (normalized Urdu paragraph), source, category, and length (post-BPE token count).
A parallel metadata.json provides sha256 checksums, original URLs when applicable, and timestamps for traceability.

This structure facilitates dataset streaming, traceable evaluation, and targeted sub-corpus extraction (Ali et al., 25 Jan 2026).

5. Custom Byte-Pair Encoding Tokenization

The corpus is paired with a script-aware custom BPE tokenizer optimized for the Perso-Arabic block (U+0600–U+06FF), Urdu numerals, and Urdu-specific punctuation. Three vocabulary sizes were explored: 10K, 20K, and 32K, with BPE merges set to approximately vocab_size minus the alphabet cardinality (∼32,000).

Tokenizer performance on a 500-word paragraph:

Tokenizer	Fertility (tok/word)	Avg. Tokens	Overhead Reduction
GPT-4 o200k	1.566	783	0% (baseline)
UrduLM-10k	1.206	603	23%
UrduLM-32k	1.110	555	29%

UrduLM’s 32K-vocab tokenizer was adopted as standard, yielding a 20–30% sequence-length reduction versus multilingual alternatives and an embedding table of 32,000 × 768 parameters (24.6M). This suggests substantial gains in both computational efficiency and model fit for Urdu-centric tasks (Ali et al., 25 Jan 2026).

6. Directory Layout, File Formats, and Data Access

The corpus is organized with research workflow reproducibility as a priority, as follows:

raw/: Untouched input collections (CC dumps, OCR, raw HTML).
cleaned/: Source-wise normalized .txt files.
deduped/: Fully deduplicated variant.
final/: Machine-readable CSV (urdu_corpus.csv) and metadata; contains all fields and per-paragraph <|EOT|> segmentation for model pretraining.
tokenizer/: BPE model files (vocab_32k.model, vocab_32k.vocab) compatible with open-source libraries such as SentencePiece.

Pretraining and further research workflows can directly consume the CSV, with all pipeline artifacts (tokenizer, corpus splits, and benchmarks) openly provided for scrutiny and extension (Ali et al., 25 Jan 2026).

7. Licensing Terms and Usage Restrictions

Data and infrastructure licensing are as follows:

FineWeb2 content: ODC-BY 1.0 (Open Data Commons Attribution)
UrduHack toolkit: MIT License
Processing scripts: Apache 2.0
OCR and translation components: Governed by Google Cloud APIs’ research-only license
All web scraping observes robots.txt and original site TOS; distribution is restricted to openly licensed or public domain sources
Overall corpus: Strictly for research and educational purposes; commercial redistribution, especially of book scans, is prohibited

These conditions reflect the layered attribution and TOS responsibilities inherent to a composite resource sourcing from both open data and third-party APIs (Ali et al., 25 Jan 2026). A plausible implication is that reproducing the corpus beyond research settings would require careful re-audit of licensing at the source and code level.

Careful adherence to the documented source distribution, cleaning and deduplication protocols, tokenizer parameters, file structure, and usage constraints enables reproducibility, extension, or targeted retokenization for future Urdu NLP research (Ali et al., 25 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

UrduLM: A Resource-Efficient Monolingual Urdu Language Model (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to 33GB Urdu Corpus.

33GB Urdu Corpus for NLP

1. Corpus Sources and Data Composition

2. Cleaning, Normalization, and Language Verification

3. Deduplication and Redundancy Control

4. Corpus Statistics and Structural Overview

5. Custom Byte-Pair Encoding Tokenization

6. Directory Layout, File Formats, and Data Access

7. Licensing Terms and Usage Restrictions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

33GB Urdu Corpus for NLP

1. Corpus Sources and Data Composition

2. Cleaning, Normalization, and Language Verification

3. Deduplication and Redundancy Control

4. Corpus Statistics and Structural Overview

5. Custom Byte-Pair Encoding Tokenization

6. Directory Layout, File Formats, and Data Access

7. Licensing Terms and Usage Restrictions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research