Nemotron-VLM-Dataset-v2 Overview
- Nemotron-VLM-Dataset-v2 is a large-scale, multi-source benchmark enabling evaluation of visual language models for document parsing and OCR.
- The dataset includes 22M high-resolution document images from synthetic and human-annotated sources across diverse languages.
- It provides detailed annotations with block-level semantic classes, reading order, and bounding-box coordinates for structured data extraction.
Nemotron-VLM-Dataset-v2 is a large-scale, multi-source, document understanding benchmark released as part of the “NVIDIA Nemotron-Parse 1.1” project, designed to facilitate the development and evaluation of advanced document parsing, OCR, and structured data extraction models. Publicly distributed as a subset of the training data for Nemotron-Parse 1.1, the dataset emphasizes diversity in language, document layout, annotation granularity, and semantic class coverage. Its construction, scale, and annotation schema render it a valuable foundation for research on visual LLMs (VLMs) and document intelligence systems (Chumachenko et al., 25 Nov 2025).
1. Dataset Composition and Modalities
Nemotron-VLM-Dataset-v2 comprises approximately 22 million high-resolution document images and pages, sourced from both synthetic and human-annotated corpora. The dataset targets multi-modal document analysis tasks, supporting the extraction of structured semantics from complex layouts. It includes content in multiple languages, reflects varied document forms, and contains annotations at block level for bounding boxes, semantic classes, and reading order (total order per page).
The following primary modalities and annotation types are present:
- Scanned Documents: High-resolution PDF renders from arXiv papers, Common Crawl, and Wikipedia.
- Synthetic Renders: HTML-to-LaTeX compiled pages, synthetic table images, and artificial multi-language text blocks.
- Public Benchmarks: Datasets such as PubTables-1M, FinTabNet, SynthTabNet, DocLayNet, and TabRecSet.
- Annotation Layers: Each text block is annotated with a bounding-box, a semantic class (e.g., Title, Formula, Table, etc.), and its sequential position via reading-order labels.
- Multilinguality: Documents and OCR content in English, Chinese, Japanese, Korean, German, French, Italian, Spanish, Dutch, Portuguese, Latin, and Greek.
2. Breakdown of Sources and Annotations
The dataset aggregates contributions from multiple published sources and synthetic generators. The publicly released training set is organized as shown below:
| Source | # Samples | Annotation Types | Languages |
|---|---|---|---|
| Multilingual arXiv (NVpdftex) | 8.3 M | Boxes + Structured Layout + Classes | En, Zh, De, Es, Fr, It, Ja |
| Multilingual Wikipedia OCR | 9.5 M | Boxes + Structured Layout + Classes | En, Fr, De, Es, It, Nl, Pt, Ja, Ko, Zh |
| Multilingual Synthetic OCR | 3.5 M | Boxes + Plain+Structured + Classes | En, Zh, Ja, Ko, Latin, Greek |
| PubTables-1M | 585 K | Boxes + Table Structure + Classes | En |
| SynthTabNet | 480 K | Boxes + Table Structure + Classes | En |
| Common Crawl, human-labeled | 255 K | Boxes + Plain+Structured + Classes | Various |
| FinTabNet | 91.5 K | Boxes + Table Structure + Classes | En |
| DocLayNet (expanded) | 56 K | Boxes + Mixed Plain/Structured + Classes | En |
| TabRecSet | 38.2 K | Boxes + Table Structure + Classes | En, Zh |
| Synthetic Tables | 26 K | Boxes + Table Structure + Classes | En |
No official validation or test splits are included; users are encouraged to partition held-out evaluation subsets as appropriate for specific experimental protocols.
3. Annotation Schema and Block-Level Semantics
Annotations follow a standardized schema enabling fine-grained structural and semantic supervision:
- Semantic Class Labels: Each block is annotated with one of the following classes: Page-Header, Title, Section-Header, Text (Paragraph), List-Item, Formula, Table, Picture, Caption, Footnote, Page-Footer.
- Bounding-Box Specification: Coordinates are normalized to a 1024×1280 reference grid, with the top-left and bottom-right corners for each text span specified as where and .
- Reading Order: A total ordering of blocks per page is provided to facilitate training of layout-aware models.
- Data Format: The model emits output sequences in the canonical form:
This format enables precise spatial and semantic alignment between model predictions and target annotations.1
<x_u> <y_v> TEXT <x_{u'}> <y_{v'}> <class_NAME>
4. Preprocessing and Technical Features
Image and token preprocessing reflect the requirements of high-fidelity document analysis:
- Resolution: All images are preserved at high resolution produced by the NVpdftex LaTeX-to-PDF pipeline (commonly 1648×2048 px per page). There is no explicit resizing or binarization apart from the generation pipeline.
- Tokenization: Text is tokenized using the mBART subword vocabulary (approximately 250,000 word-pieces).
- Compact Representation: Bounding-boxes use relative coordinates for efficiency and compatibility across source page dimensions.
This preprocessing ensures consistency across heterogeneous data sources and supports robust model training and evaluation.
5. Licensing, Distribution, and Access
Nemotron-VLM-Dataset-v2 is distributed under NVIDIA’s standard open-source license. The complete dataset, model weights, and related resources are available on Huggingface:
- Dataset: https://huggingface.co/datasets/nvidia/Nemotron-VLM-Dataset-v2
- Model Weights: https://huggingface.co/nvidia/NVIDIA-Nemotron-Parse-v1.1 and https://huggingface.co/nvidia/NVIDIA-Nemotron-Parse-v1.1-TC
Users are requested to cite Karmanov et al., “NVIDIA Nemotron-Parse 1.1,” 2025 when utilizing this dataset (Chumachenko et al., 25 Nov 2025).
6. Practical Considerations and Usage Recommendations
Best practices for leveraging Nemotron-VLM-Dataset-v2 include:
- Joint Training: Models should be jointly trained across heterogeneous sources, mapping each annotation schema to a uniform prompt interface.
- Prompting Strategy: Employ the maximal-information prompt (MIP):
<output_markdown><predict_bbox><predict_classes>. - Generalization: Omit explicit 1D positional embeddings in the decoder (“NoPE”), promoting generalization to arbitrarily long output sequences.
- Training Efficiency: Use multi-token prediction to accelerate convergence and improve inference accuracy.
- Evaluation: The creators recommend users define their own held-out sets due to absence of standardized splits.
The dataset has supported the development of Nemotron-Parse 1.1, which demonstrates strong accuracy on internal and public benchmarks for OCR, reading-order prediction, and structured table parsing, highlighting its effectiveness as a resource for advanced document understanding research (Chumachenko et al., 25 Nov 2025).