Papers
Topics
Authors
Recent
Search
2000 character limit reached

Nemotron-VLM-Dataset-v2 Overview

Updated 21 January 2026
  • Nemotron-VLM-Dataset-v2 is a large-scale, multi-source benchmark enabling evaluation of visual language models for document parsing and OCR.
  • The dataset includes 22M high-resolution document images from synthetic and human-annotated sources across diverse languages.
  • It provides detailed annotations with block-level semantic classes, reading order, and bounding-box coordinates for structured data extraction.

Nemotron-VLM-Dataset-v2 is a large-scale, multi-source, document understanding benchmark released as part of the “NVIDIA Nemotron-Parse 1.1” project, designed to facilitate the development and evaluation of advanced document parsing, OCR, and structured data extraction models. Publicly distributed as a subset of the training data for Nemotron-Parse 1.1, the dataset emphasizes diversity in language, document layout, annotation granularity, and semantic class coverage. Its construction, scale, and annotation schema render it a valuable foundation for research on visual LLMs (VLMs) and document intelligence systems (Chumachenko et al., 25 Nov 2025).

1. Dataset Composition and Modalities

Nemotron-VLM-Dataset-v2 comprises approximately 22 million high-resolution document images and pages, sourced from both synthetic and human-annotated corpora. The dataset targets multi-modal document analysis tasks, supporting the extraction of structured semantics from complex layouts. It includes content in multiple languages, reflects varied document forms, and contains annotations at block level for bounding boxes, semantic classes, and reading order (total order per page).

The following primary modalities and annotation types are present:

  • Scanned Documents: High-resolution PDF renders from arXiv papers, Common Crawl, and Wikipedia.
  • Synthetic Renders: HTML-to-LaTeX compiled pages, synthetic table images, and artificial multi-language text blocks.
  • Public Benchmarks: Datasets such as PubTables-1M, FinTabNet, SynthTabNet, DocLayNet, and TabRecSet.
  • Annotation Layers: Each text block is annotated with a bounding-box, a semantic class (e.g., Title, Formula, Table, etc.), and its sequential position via reading-order labels.
  • Multilinguality: Documents and OCR content in English, Chinese, Japanese, Korean, German, French, Italian, Spanish, Dutch, Portuguese, Latin, and Greek.

2. Breakdown of Sources and Annotations

The dataset aggregates contributions from multiple published sources and synthetic generators. The publicly released training set is organized as shown below:

Source # Samples Annotation Types Languages
Multilingual arXiv (NVpdftex) 8.3 M Boxes + Structured Layout + Classes En, Zh, De, Es, Fr, It, Ja
Multilingual Wikipedia OCR 9.5 M Boxes + Structured Layout + Classes En, Fr, De, Es, It, Nl, Pt, Ja, Ko, Zh
Multilingual Synthetic OCR 3.5 M Boxes + Plain+Structured + Classes En, Zh, Ja, Ko, Latin, Greek
PubTables-1M 585 K Boxes + Table Structure + Classes En
SynthTabNet 480 K Boxes + Table Structure + Classes En
Common Crawl, human-labeled 255 K Boxes + Plain+Structured + Classes Various
FinTabNet 91.5 K Boxes + Table Structure + Classes En
DocLayNet (expanded) 56 K Boxes + Mixed Plain/Structured + Classes En
TabRecSet 38.2 K Boxes + Table Structure + Classes En, Zh
Synthetic Tables 26 K Boxes + Table Structure + Classes En

No official validation or test splits are included; users are encouraged to partition held-out evaluation subsets as appropriate for specific experimental protocols.

3. Annotation Schema and Block-Level Semantics

Annotations follow a standardized schema enabling fine-grained structural and semantic supervision:

  • Semantic Class Labels: Each block is annotated with one of the following classes: Page-Header, Title, Section-Header, Text (Paragraph), List-Item, Formula, Table, Picture, Caption, Footnote, Page-Footer.
  • Bounding-Box Specification: Coordinates are normalized to a 1024×1280 reference grid, with the top-left and bottom-right corners for each text span specified as (xrel,yrel)(x_{rel}, y_{rel}) where xrel=xpixel/1024x_{rel} = x_{pixel}/1024 and yrel=ypixel/1280y_{rel} = y_{pixel}/1280.
  • Reading Order: A total ordering of blocks per page is provided to facilitate training of layout-aware models.
  • Data Format: The model emits output sequences in the canonical form:
    1
    
    <x_u> <y_v> TEXT <x_{u'}> <y_{v'}> <class_NAME>
    This format enables precise spatial and semantic alignment between model predictions and target annotations.

4. Preprocessing and Technical Features

Image and token preprocessing reflect the requirements of high-fidelity document analysis:

  • Resolution: All images are preserved at high resolution produced by the NVpdftex LaTeX-to-PDF pipeline (commonly 1648×2048 px per page). There is no explicit resizing or binarization apart from the generation pipeline.
  • Tokenization: Text is tokenized using the mBART subword vocabulary (approximately 250,000 word-pieces).
  • Compact Representation: Bounding-boxes use relative coordinates for efficiency and compatibility across source page dimensions.

This preprocessing ensures consistency across heterogeneous data sources and supports robust model training and evaluation.

5. Licensing, Distribution, and Access

Nemotron-VLM-Dataset-v2 is distributed under NVIDIA’s standard open-source license. The complete dataset, model weights, and related resources are available on Huggingface:

Users are requested to cite Karmanov et al., “NVIDIA Nemotron-Parse 1.1,” 2025 when utilizing this dataset (Chumachenko et al., 25 Nov 2025).

6. Practical Considerations and Usage Recommendations

Best practices for leveraging Nemotron-VLM-Dataset-v2 include:

  • Joint Training: Models should be jointly trained across heterogeneous sources, mapping each annotation schema to a uniform prompt interface.
  • Prompting Strategy: Employ the maximal-information prompt (MIP): <output_markdown><predict_bbox><predict_classes>.
  • Generalization: Omit explicit 1D positional embeddings in the decoder (“NoPE”), promoting generalization to arbitrarily long output sequences.
  • Training Efficiency: Use multi-token prediction to accelerate convergence and improve inference accuracy.
  • Evaluation: The creators recommend users define their own held-out sets due to absence of standardized splits.

The dataset has supported the development of Nemotron-Parse 1.1, which demonstrates strong accuracy on internal and public benchmarks for OCR, reading-order prediction, and structured table parsing, highlighting its effectiveness as a resource for advanced document understanding research (Chumachenko et al., 25 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Nemotron-VLM-Dataset-v2.