NVIDIA Nemotron Parse 1.1: Advanced VLT OCR

Updated 15 January 2026

NVIDIA Nemotron Parse 1.1 is a lightweight Vision-Language Transformer model for end-to-end document parsing and OCR, extracting formatted text, tables, and multi-modal elements.
It employs an encoder–decoder architecture with a high-capacity vision encoder and a 10-layer mBART decoder, totaling 885M parameters for processing complex documents.
Innovations include advanced table parsing using LaTeX, multi-modal extraction from charts and diagrams, and token compression to boost throughput by up to 20%.

NVIDIA Nemotron Parse 1.1 is a lightweight end-to-end Vision-Language Transformer (VLT) model tailored for advanced document parsing and OCR. As a successor to Nemotron-Parse 1.0 (Eclair), version 1.1 expands the capabilities to structured extraction of formatted text, tables (in LaTeX), and multi-modal text recovery from embedded charts, diagrams, and pictures, while maintaining high throughput suitable for both research and large-scale deployment. Nemotron-Parse 1.1 operates with an encoder-decoder architecture comprising 885M parameters, rapidly processing complex documents while supporting longer output sequence lengths and providing bounding boxes and semantic class labels for extracted text (Chumachenko et al., 25 Nov 2025).

1. System Architecture and Parameterization

Nemotron-Parse 1.1 is structured as a classical encoder–decoder Transformer:

Vision Encoder: RADIO v2.5 (ViT-H/16 backbone, 657M parameters) encodes the input image $\mathbf{I} \in \mathbb{R}^{3 \times H \times W}$ to dense features $\mathbf{Z} = \mathcal{E}(\mathbf{I}) \in \mathbb{R}^{N \times d}$ .
Convolutional Neck: A horizontal $1 \times 4$ kernel compresses sequence length to 3,200 tokens (for a 1,648 × 2,048 page), with one summary token added (final ≈3,201 encoder tokens).
Language Decoder: A 10-layer mBART (256M parameters, tied weights) predicts output sequence tokens, omitting positional embeddings to enable support for output sequences well beyond 1,000 tokens.
No Positional Embeddings in Decoder: The omission of decoder positional embeddings avoids interpolation artifacts for long sequences by relying solely on causal masking.

Parameter breakdown:

Module	Parameter Count	Remarks
Vision Encoder	657M	RADIO v2.5 (ViT-H/16)
Neck	Negligible	Convolutional, non-transformer
Decoder	256M	10-layer mBART, tied weights
Total	885,766,720

A variant, Nemotron-Parse-1.1-TC ("token-compressed," Editor's term), further pixel-shuffles the neck output, reducing vision tokens from 3,200 to 833 (by $4\times$ compression in each spatial dimension).

2. Innovations Over Nemotron-Parse 1.0

Nemotron-Parse 1.1 introduces several enhancements relative to version 1.0:

Data Expansion: Leveraged expanded multilingual dense-text data sources (NVpdftex synthetic, Common Crawl, Wikipedia OCR) for broader OCR training.
Structured Formatting: Advanced parsing generates Markdown and LaTeX, with tables reproduced in LaTeX and inline math handled either via LaTeX or super/sub-script encoding.
Improved Table Parsing: End-to-end generation of LaTeX-based table structures, achieving strong TEDS (Tree-Edit Distance-based Similarity) and S-TEDS metrics.
Multi-modal Extraction: Extended training with VQA-style crops and datasets (DocLayNet) to support text extraction from images, charts, and diagrams, not just scanned text.
Extended Output Range: Elimination of decoder positional embeddings enables coherent sequence generation for outputs far exceeding 1,000 tokens.
Inference Speed-up: Multi-token prediction heads trained for parallel next-token generation, achieving inference acceleration by decoding in groups ( $\leq 3\times$ reduction in per-token latency).

3. Training Methodologies and Objective Functions

Nemotron-Parse 1.1 is trained end-to-end with a composite loss integrating both text generation and bounding box regression:

Cross-Entropy Loss for Text Generation:

$\mathcal{L}_{\mathrm{CE}} = -\sum_{i=P+1}^L \log P(t_i^* | \mathcal{N}(\mathbf{Z}), t_{<i})$

where $\{t_1, ..., t_P\}$ are prompt tokens controlling output markup/class prediction.

Bounding Box Regression:

$\mathcal{L}_{\mathrm{bbox}} = \frac{1}{B} \sum_{b=1}^B \| \hat{\mathbf{b}}_b - \mathbf{b}_b^* \|^2$

with $(x_{\min}, y_{\min}, x_{\max}, y_{\max})$ as normalized coordinates.

Total Loss Function:

$\mathcal{L} = \mathcal{L}_{\mathrm{CE}} + \lambda_{\mathrm{bbox}} \mathcal{L}_{\mathrm{bbox}}$

Multi-token Prediction: Auxiliary heads $l_1, l_2$ are applied to combine the last ground-truth and previously predicted hidden state embeddings, enabling group prediction during both training and inference.

4. Benchmark Performance and Comparative Results

Nemotron-Parse 1.1 demonstrates competitive or state-of-the-art performance across several public and internal benchmarks. Key comparative benchmark results include:

Model-Method	Mask Out	WER ↓	F1 ↑
Kosmos-2.5 (ocr-mode)	+	0.195	0.937
Kosmos-2.5 (md-mode)	+	0.249	0.890
GOT (ocr-mode)	+	0.302	0.818
Nemotron-Parse-MIP	-	0.109	0.958
Nemotron-Parse-MIP	+	0.102	0.957
Nemotron-Parse-TC-MIP	-	0.111	0.953
Nemotron-Parse-TC-MIP	+	0.121	0.949

On the GOT benchmark:

Extractor	OCR/F1 Score ↑	Text-Only RO/Edit Dist. ↓	Text-Only RO/METEOR ↑	Text-Only RO/BLEU ↑
Pdfium	0.0036	0.9993	0.0083	0.0000
Docling	0.6744	0.4300	0.6331	0.4651
Nemotron-Parse-1.1	0.9785	0.014	0.9858	0.9623
Nemotron-Parse-1.1-TC	0.9755	0.014	0.9838	0.9582

On OmniDocBench 1.0, Nemotron-Parse 1.1 exhibits table, formula, and reading-order scores that outperform all other end-to-end models under a 4,000 vision-token constraint. PubTabNet and RD-TableBench benchmarks show TEDS > 80 and table similarity ≈ 86.

5. Token-Compressed Variant (Nemotron-Parse-1.1-TC) and Throughput

The token-compressed (TC) variant utilizes a pixel-shuffle operation to decrease vision encoder output to 833 tokens. This modification yields a 20% throughput gain (from ~3,800 tokens/sec to 4,500 tokens/sec or ~4–5 pages/sec at 1,000 tokens/page) with negligible impact on main accuracy metrics (F1 drop ≤0.005 on GOT; ≤0.003 on OmniDocBench). Notably, reading-order consistency slightly improves due to integrated float-element ordering.

6. Released Artifacts, Integrations, and Deployment

NVIDIA provides both model weights (fp32/bf16) and an optimized NIM inference container:

Model Weights (HuggingFace):
- Nemotron-Parse-v1.1: https://huggingface.co/nvidia/NVIDIA-Nemotron-Parse-v1.1
- Nemotron-Parse-v1.1-TC: https://huggingface.co/nvidia/NVIDIA-Nemotron-Parse-v1.1-TC
NIM Container: Optimized for production scaling—https://build.nvidia.com/nvidia/nemotron-parse
Datasets: The Nemotron-VLM-Dataset-v2, including NVpdftex synthetic, Common Crawl, DocLayNet, and PubTabNet.

Deployment scenarios encompass batch scientific paper processing, interactive document understanding, and edge/cloud-edge orchestration due to the favorable tradeoff between resource usage and throughput in the TC variant. For in-domain adaptation, continued supervised fine-tuning with customized prompt tokens (JSON, HTML, semantic classes) is supported. Knowledge distillation enables further parameter budget reduction for resource-constrained use cases.

Integration with LLM back-ends enables complex downstream tasks such as semantic question answering or table querying, promoting document understanding workflows from pixels to structured and semantically annotated textual representations (Chumachenko et al., 25 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

NVIDIA Nemotron Parse 1.1 (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NVIDIA Nemotron Parse 1.1.