Papers
Topics
Authors
Recent
Search
2000 character limit reached

Thai-Focused Training Dataset

Updated 28 January 2026
  • Thai-focused training dataset is a specialized corpus curated to develop and fine-tune vision-language models for accurate OCR and layout extraction in Thai documents.
  • It employs a four-stage pipeline—traditional OCR, VLM-based restructuring, automated quality control, and human verification—to address linguistic and structural challenges.
  • The dataset achieves state-of-the-art performance on diverse document types, improving extractive accuracy across Thai books, financial reports, and handwritten forms.

A Thai-focused training dataset refers to a supervised corpus specifically engineered for the development and fine-tuning of machine learning models—particularly vision-LLMs (VLMs)—optimized for the extraction of structured information, text transcription, and layout reconstruction from Thai (and English) documents. This class of datasets is distinguished by rigorous multi-stage data construction pipelines that address the unique linguistic and structural complexities of Thai documents, including script complexity, the lack of explicit word boundaries, and the prevalence of semi-structured and visually heterogeneous real-world content. The Typhoon OCR series exemplifies this approach, achieving state-of-the-art results on diverse Thai document categories by leveraging such carefully curated data resources (Nonesung et al., 21 Jan 2026).

1. Rationale and Challenges in Thai Document Modeling

The development of a Thai-focused training dataset is motivated by the linguistic and structural challenges inherent to Thai documents. Unlike high-resource languages predominantly supported by existing VLMs, Thai employs a non-Latin script without explicit word boundaries and is frequently encountered in documents with highly variable or unstructured layouts. These properties limit the generalization capacity of open-source models pretrained predominantly on English or related scripts and necessitate bespoke training data engineering. Documents such as financial reports, government forms, books, infographics, and handwritten forms present diverse structural regularities and noise profiles that must be faithfully represented in the training corpus to support robust downstream extraction (Nonesung et al., 21 Jan 2026).

2. Multi-Stage Data Construction Pipeline

The Typhoon OCR framework employs a four-stage hybrid automatic and semi-automatic data construction pipeline, under a structure mode tailored for complex layouts. The process is as follows:

  1. Traditional OCR: Extraction of raw character- and word-level transcription from high-quality scans using open-source OCR engines (e.g., PaddleOCR, Tesseract).
  2. VLM-Based Restructuring: Augmentation of OCR outputs by utilizing open-source VLMs with structured prompts to hierarchically group lines, tables, and sections into semantically rich representations (Markdown/HTML).
  3. Automated Quality Control (QC): Application of agent-based checking procedures to identify and filter out samples with missing, duplicated, misordered, or misaligned text blocks.
  4. Human Verification: Manual annotation and validation for high-importance samples to excise irreparable errors.

This pipeline delivers a corpus whose composition, in the Typhoon OCR V1 instance, includes infographics (45.6%), synthetic documents (8.3%), financial reports (7.2%), books (5.6%), handwriting (5.5%), scanned book pages (6.2%), bills and invoices (8.7%), and a long-tail of forms and certificates (13.0%), summing to 77,029 documents. Typhoon OCR V1.5 extends this to 155,403 samples, with significant contributions from DocLayNet, synthetic data with LaTeX/chart renderings, and Thai-translated VQA ("The Cauldron") data (Nonesung et al., 21 Jan 2026).

3. Dataset Composition and Modalities

The Thai-focused training datasets integrate a heterogeneous array of document sources and rendering modalities to ensure coverage across visually and semantically diverse use cases. In Typhoon OCR V1.5, 53.7% of samples are retained from V1 (real Thai/English documents), 6.4% derive from DocLayNet v1.2 with detailed layout annotation, 2.2% are Thai-translated VQA data to preserve general multimodal grounding, and 37.6% are synthetic documents incorporating LaTeX formulas, chart images (ChartCap, SEA-VL Crawling), Thai-vocabulary renderings, and Augraphy-based augmentations.

The evolution from a mode-selection system (default versus structure) with PDF anchor metadata toward a unified, "visual-only" mode in V1.5 simplifies model training and deployment while maintaining layout fidelity. Elimination of mode selection and PDF metadata dependencies is empirically shown to have no measurable negative impact on extractive accuracy (Nonesung et al., 21 Jan 2026).

4. Supervised Objectives and Layout Consistency

Training is conducted via supervised minimization of the standard autoregressive cross-entropy objective over document image and ground-truth token sequence pairs: LCE=t=1Tlogpθ(yty1:t1,Enc(X))\mathcal{L}_{\mathrm{CE}} = -\sum_{t=1}^T \log p_\theta\bigl(y_t \mid y_{1:t-1},\,\mathrm{Enc}(X)\bigr) where XX denotes the document image and Y=[y1,...,yT]Y = [y_1, ..., y_T] the target token sequence.

The VLM backbone is responsible for not only transcription but for joint prediction of region segmentation via interleaved layout markers (e.g., <region id=…>, Markdown headings, <table>, <figure>). Post-processing the generated token stream yields bounding box predictions for content regions. Layout-quality control employs metrics such as intersection-over-union (IoU) for predicted versus ground-truth bounding boxes, as well as a global layout loss: Llayout=rregions(1IoU(rpred,rgt))\mathcal{L}_{\mathrm{layout}} = \sum_{r\in \mathrm{regions}}\bigl(1 - \mathrm{IoU}(r_{\mathrm{pred}},\,r_{\mathrm{gt}})\bigr) Automated and human-in-the-loop QC incorporate these calculations to enforce document-level structural integrity (Nonesung et al., 21 Jan 2026).

5. Performance Evaluation and Benchmarks

Empirical performance of the models trained on the Thai-focused datasets is assessed using BLEU-n (n-gram precision), ROUGE-L (longest common subsequence), and normalized character error rate (CER). Comparative results on held-out Thai documents demonstrate the superiority of Typhoon OCR V1 and V1.5 over proprietary frontier models such as GPT-4o, Gemini 2.5, and GPT-5, especially on structured or semantically regular categories:

Category Gemini 2.5 Pro GPT-5 Typhoon OCR V1 7B Typhoon OCR V1.5 2B
Thai Books 0.512/0.676/0.334 0.710/0.922/0.084 0.708/0.871/0.136 0.746/0.949/0.053
Thai Gov. Forms 0.797/0.894/0.096 0.569/0.706/0.267 0.849/0.942/0.065 0.870/0.967/0.035
Thai Fin. Reports 0.657/0.757/0.256 0.457/0.603/0.356 0.849/0.933/0.082 0.819/0.910/0.079
Infographics 0.465/0.677/0.380 0.297/0.481/0.561 0.246/0.373/0.671 0.408/0.527/0.544
Handwritten Forms 0.594/0.739/0.327 0.368/0.514/0.533 0.321/0.454/0.556 0.522/0.645/0.416
Others 0.603/0.716/0.342 0.352/0.482/0.540 0.376/0.541/0.480 0.499/0.645/0.377
Average 0.605/0.743/0.289 0.459/0.618/0.390 0.558/0.686/0.332 0.644/0.774/0.251

(Category metric order: BLEU / ROUGE-L / CER) (Nonesung et al., 21 Jan 2026).

Typhoon OCR V1.5 (2B parameters) matches or exceeds the 7B-parameter predecessor and consistently outperforms proprietary benchmarks on structured document categories. On visually heterogeneous data (infographics, handwriting), the performance gap narrows but CER remains higher than the best proprietary systems. Quantization-aware training and standardized preprocessing (fixed image width 1,800 px) further enhance deployment efficiency with negligible loss in BLEU, supporting int8 (or lower) inference and simplifying integration into containerized or edge deployments.

6. Significance and Future Research Directions

The Thai-focused training dataset paradigm, exemplified by Typhoon OCR, substantiates the efficacy of pairing open VLM backbones with highly targeted, rigorously curated corpus design—including synthetic augmentation, hierarchical restructuring, and multimodal annotation. These advances enable accurate OCR and document layout modeling for underrepresented scripts at a fraction of the computational resource requirements of prevailing proprietary alternatives. Potential future directions include expansion to further Southeast Asian scripts, enhanced augmentation pipelines for extreme handwriting and low-quality scans, and comprehensive exploration of self-supervised and active learning for ongoing corpus enrichment (Nonesung et al., 21 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Thai-Focused Training Dataset.