Papers
Topics
Authors
Recent
Search
2000 character limit reached

Kashmiri OCR Dataset Pipeline

Updated 24 January 2026
  • Kashmiri OCR dataset is a comprehensive collection of synthetic image-text pairs generated through a multi-stage pipeline using diverse fonts and backgrounds.
  • It employs extensive data augmentation techniques, including rotation, blur, and noise, to closely mimic real-world document degradations for robust OCR training.
  • The dataset achieves high word-level accuracy and demonstrates script-purity by covering all Kashmiri-specific codepoints, making it valuable for low-resource applications.

A Kashmiri OCR dataset is a collection of labeled data designed to support the training and evaluation of Optical Character Recognition (OCR) systems for the Kashmiri language. Kashmiri, with approximately 7 million speakers and written primarily in a complex Perso-Arabic script rich in diacritics, has historically been unsupported by major OCR frameworks due to a lack of large, annotated datasets. Recent advances in synthetic data generation have addressed this bottleneck by enabling the automatic construction of large-scale, high-diversity Kashmiri OCR datasets using programmatic pipelines (Malik et al., 22 Jan 2026).

1. Motivations and Challenges in Kashmiri OCR Data

Kashmiri presents typographic and linguistic hurdles for OCR: its Perso-Arabic script includes unique codepoints (covering 38 base, 8 extended, 18 diacritics and punctuation marks, and 10 numerals), complex ligaturing, and a high degree of intra-script variation. Manual dataset construction is prohibitive due to the need for expert transcription of printed or handwritten sources at the word or character level—a task further complicated by the lack of readily available ground-truth corpora and the inconsistent support for Kashmiri-specific glyphs in mainstream typefaces and text segmenters. This context defines the necessity for automated, scalable solutions to generate diverse, script-pure training data (Malik et al., 22 Jan 2026).

2. Synthetic Dataset Generation Pipeline

The production of a Kashmiri OCR dataset via systems like SynthOCR-Gen is organized into a multi-stage pipeline, formalized as a composition of pure functions: D=πoutϕaugρrenderψvalidσseg(C)D = \pi_\mathrm{out} \circ \phi_\mathrm{aug} \circ \rho_\mathrm{render} \circ \psi_\mathrm{valid} \circ \sigma_\mathrm{seg}(C) where CC is the input Unicode text corpus and DD is the output set of image–text pairs (Malik et al., 22 Jan 2026). The five stages are:

  1. Text Segmentation (σseg\sigma_\mathrm{seg}): Text is split into segments at the character, word, n-gram, sentence, or line level; word-level segmentation predominates for Kashmiri OCR. UAX-29 grapheme cluster rules and Kashmiri-specific delimiters are applied.
  2. Unicode Normalization & Script Enforcement (ψvalid\psi_\mathrm{valid}): Segments are mapped to NFC form and codepoints are filtered against accepted Kashmiri script ranges (UArabicUCommonU_\mathrm{Arabic} \cup U_\mathrm{Common}), allowing only well-formed, script-pure samples that retain all Kashmiri diacritics.
  3. Multi-font Rendering (ρrender\rho_\mathrm{render}): Each segment is rendered as an RGB image using a probabilistically sampled font and font size. Rendering handles right-to-left alignment and ensures typographic diversity (covering Noto Naskh Arabic, Gulmarg Nastaleeq, Scheherazade New, etc.).
  4. Data Augmentation (ϕaug\phi_\mathrm{aug}): A configurable set of 25+ augmentations is stochastically applied per sample, simulating document degradations (blur, noise, rotation, JPEG compression, resolution downsampling, background change).
  5. Packaging (πout\pi_\mathrm{out}): Outputs are bundled in formats compatible with modern OCR pipelines (CRNN, TrOCR, CSV, HuggingFace CSV/JSONL), with optional train/val split and reproducible PRNG seeding.

This pipeline generates large, high-diversity image-text datasets suitable for both supervised model training and benchmarking.

Component Value/Range Notes
Corpus KS-LIT-3M (3.1M words, 2.85M chars, 31,562 sentences) Script-pure Perso-Arabic with diacritics
Segmentation Mode Word Avg. 6.8 chars/sample
Number of Samples 600,000 (90% train, 10% val) Randomized, reproducible splits
Fonts Noto Naskh Arabic (40%), Gulmarg Nastaleeq (35%), Scheherazade New (25%) Inverse-CDF sampling
Image Size 256 × 64 px
Backgrounds White (30%), Aged (25%), Book-page (20%), Newspaper (15%), Parchment (10%) Simulates real-world layouts
Augmentation Rate 0.7 Avg. 2.3 transforms/sample, up to 4

3. Core Data Augmentation Techniques

Data augmentation is critical to closing the domain gap between synthetic and real document images. The Kashmiri OCR dataset employs a large pool of stochastic transforms, formally parameterized as:

  • Rotation: Rθ:θUniform(10,10)R_\theta: \theta \sim \mathrm{Uniform}(-10^\circ, 10^\circ)
  • Skew (Affine): S=[1sx0 sy10 001], sxU(0.2,0.2)S = \begin{bmatrix} 1 & s_x & 0 \ s_y & 1 & 0 \ 0 & 0 & 1 \end{bmatrix},\ s_x \sim \mathrm{U}(-0.2, 0.2)
  • Blur: Gaussian with σU(0.5,2.0)\sigma \sim \mathrm{U}(0.5, 2.0) px
  • Motion Blur: Kernel size k[3,7]k \in [3, 7] px, angle αU(0,2π)\alpha \sim \mathrm{U}(0, 2\pi)
  • Noise: Gaussian ηN(0,σn2),σn[5,25]\eta \sim \mathcal{N}(0, \sigma_n^2), \sigma_n \in [5, 25]
  • Salt & Pepper: Probability psp[0.01,0.05]p_\mathrm{sp} \in [0.01, 0.05]
  • JPEG Artifacts: Compression qU(30,70)q \sim \mathrm{U}(30, 70)
  • Resolution Downsampling: rU(0.3,0.7)r \sim \mathrm{U}(0.3, 0.7)

Empirical augmentation frequencies for the Kashmiri dataset were: rotation (40.1%), brightness variance (35.1%), Gaussian blur (30.2%), Gaussian noise (26.1%), and others. This distribution was calibrated to induce degradations matching those observed in real-world scanned pages (Malik et al., 22 Jan 2026).

4. Dataset Characteristics and Coverage

The resulting 600,000-sample Kashmiri dataset achieves:

  • Character coverage: All 85 Kashmiri-relevant codepoints (including diacritics and numerals).
  • Segmentation: 100% word-level, with an average of 6.8 characters per sample and 89,743 unique word types.
  • Script purity: Unicode normalization and codepoint filtering ensure each sample strictly adheres to script boundaries and preserves all Kashmiri-specific linguistic features.
  • Font and background diversity: Synthetic data samples are rendered across multiple widely used and authentic fonts, with randomized backgrounds simulating document scenarios from clean, to aged, to newspaper print (Malik et al., 22 Jan 2026).

5. Applications and Empirical Findings

Kashmiri OCR datasets, generated using comprehensive pipelines such as SynthOCR-Gen, have enabled multiple downstream applications:

  • Supervised OCR model training: CRNNs and TrOCR trained on these data achieve word-level accuracy exceeding 95% and character error rate (CER) below 3% on held-out real scan sets, following minimal fine-tuning. This demonstrates the sufficiency of synthetic data for high performance in low-resource settings (Malik et al., 22 Jan 2026).
  • Script-agnostic extensibility: The pipeline generalizes to other Perso-Arabic scripts or codepoint sets through configuration; only a domain-specific font and normalization configuration is necessary for adaptation.
  • Augmentation strategies: Empirical results favor a 70:30 ratio of augmented to clean data, with moderate transform depths (2–4 per sample) and a minimum of three diversified fonts to boost generalizability.

6. Configuration, Reproducibility, and Distribution

SynthOCR-Gen and similar toolkits provide full transparency and control over dataset generation via configuration files or APIs. Key configuration elements include segmentation mode, font distribution, augmentation system (probabilities and parameter ranges), background setup, output format selection, and reproducibility via PRNG seeding. Output formats include those compatible with major OCR research toolkits (e.g., CRNN, TrOCR, HuggingFace datasets). The Kashmiri dataset and associated tools are publicly released under MIT license, advancing accessibility for the research community (Malik et al., 22 Jan 2026).

7. Limitations and Prospects

The current focus of the Kashmiri OCR dataset is on printed text; handwritten synthesis is a prospective extension, potentially implementable via neural handwriting-generation models or stroke simulation. Multi-script (e.g., Devanagari for Kashmiri), full-page layouts, and active-learning-based real data mining are identified as feasible future directions. The absence of real document scans in initial training is partly mitigated by augmentation, but improved domain adaptation remains an open challenge (Malik et al., 22 Jan 2026).


References:

  • S. Wani et al., "synthocr-gen: A synthetic ocr dataset generator for low-resource languages- breaking the data barrier" (Malik et al., 22 Jan 2026)
  • C. Grover et al., "Advancing Post-OCR Correction: A Comparative Study of Synthetic Data" (Guan et al., 2024)
  • D. Hammerl et al., "OmniPrint: A Configurable Printed Character Synthesizer" (Sun et al., 2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Kashmiri OCR Dataset.