NVIDIA Nemotron Parse 1.1: Advanced VLT OCR
- NVIDIA Nemotron Parse 1.1 is a lightweight Vision-Language Transformer model for end-to-end document parsing and OCR, extracting formatted text, tables, and multi-modal elements.
- It employs an encoder–decoder architecture with a high-capacity vision encoder and a 10-layer mBART decoder, totaling 885M parameters for processing complex documents.
- Innovations include advanced table parsing using LaTeX, multi-modal extraction from charts and diagrams, and token compression to boost throughput by up to 20%.
NVIDIA Nemotron Parse 1.1 is a lightweight end-to-end Vision-Language Transformer (VLT) model tailored for advanced document parsing and OCR. As a successor to Nemotron-Parse 1.0 (Eclair), version 1.1 expands the capabilities to structured extraction of formatted text, tables (in LaTeX), and multi-modal text recovery from embedded charts, diagrams, and pictures, while maintaining high throughput suitable for both research and large-scale deployment. Nemotron-Parse 1.1 operates with an encoder-decoder architecture comprising 885M parameters, rapidly processing complex documents while supporting longer output sequence lengths and providing bounding boxes and semantic class labels for extracted text (Chumachenko et al., 25 Nov 2025).
1. System Architecture and Parameterization
Nemotron-Parse 1.1 is structured as a classical encoder–decoder Transformer:
- Vision Encoder: RADIO v2.5 (ViT-H/16 backbone, 657M parameters) encodes the input image to dense features .
- Convolutional Neck: A horizontal kernel compresses sequence length to 3,200 tokens (for a 1,648 × 2,048 page), with one summary token added (final ≈3,201 encoder tokens).
- Language Decoder: A 10-layer mBART (256M parameters, tied weights) predicts output sequence tokens, omitting positional embeddings to enable support for output sequences well beyond 1,000 tokens.
- No Positional Embeddings in Decoder: The omission of decoder positional embeddings avoids interpolation artifacts for long sequences by relying solely on causal masking.
Parameter breakdown:
| Module | Parameter Count | Remarks |
|---|---|---|
| Vision Encoder | 657M | RADIO v2.5 (ViT-H/16) |
| Neck | Negligible | Convolutional, non-transformer |
| Decoder | 256M | 10-layer mBART, tied weights |
| Total | 885,766,720 |
A variant, Nemotron-Parse-1.1-TC ("token-compressed," Editor's term), further pixel-shuffles the neck output, reducing vision tokens from 3,200 to 833 (by compression in each spatial dimension).
2. Innovations Over Nemotron-Parse 1.0
Nemotron-Parse 1.1 introduces several enhancements relative to version 1.0:
- Data Expansion: Leveraged expanded multilingual dense-text data sources (NVpdftex synthetic, Common Crawl, Wikipedia OCR) for broader OCR training.
- Structured Formatting: Advanced parsing generates Markdown and LaTeX, with tables reproduced in LaTeX and inline math handled either via LaTeX or super/sub-script encoding.
- Improved Table Parsing: End-to-end generation of LaTeX-based table structures, achieving strong TEDS (Tree-Edit Distance-based Similarity) and S-TEDS metrics.
- Multi-modal Extraction: Extended training with VQA-style crops and datasets (DocLayNet) to support text extraction from images, charts, and diagrams, not just scanned text.
- Extended Output Range: Elimination of decoder positional embeddings enables coherent sequence generation for outputs far exceeding 1,000 tokens.
- Inference Speed-up: Multi-token prediction heads trained for parallel next-token generation, achieving inference acceleration by decoding in groups ( reduction in per-token latency).
3. Training Methodologies and Objective Functions
Nemotron-Parse 1.1 is trained end-to-end with a composite loss integrating both text generation and bounding box regression:
- Cross-Entropy Loss for Text Generation:
where are prompt tokens controlling output markup/class prediction.
- Bounding Box Regression:
with as normalized coordinates.
- Total Loss Function:
- Multi-token Prediction: Auxiliary heads are applied to combine the last ground-truth and previously predicted hidden state embeddings, enabling group prediction during both training and inference.
4. Benchmark Performance and Comparative Results
Nemotron-Parse 1.1 demonstrates competitive or state-of-the-art performance across several public and internal benchmarks. Key comparative benchmark results include:
| Model-Method | Mask Out | WER ↓ | F1 ↑ |
|---|---|---|---|
| Kosmos-2.5 (ocr-mode) | + | 0.195 | 0.937 |
| Kosmos-2.5 (md-mode) | + | 0.249 | 0.890 |
| GOT (ocr-mode) | + | 0.302 | 0.818 |
| Nemotron-Parse-MIP | - | 0.109 | 0.958 |
| Nemotron-Parse-MIP | + | 0.102 | 0.957 |
| Nemotron-Parse-TC-MIP | - | 0.111 | 0.953 |
| Nemotron-Parse-TC-MIP | + | 0.121 | 0.949 |
On the GOT benchmark:
| Extractor | OCR/F1 Score ↑ | Text-Only RO/Edit Dist. ↓ | Text-Only RO/METEOR ↑ | Text-Only RO/BLEU ↑ |
|---|---|---|---|---|
| Pdfium | 0.0036 | 0.9993 | 0.0083 | 0.0000 |
| Docling | 0.6744 | 0.4300 | 0.6331 | 0.4651 |
| Nemotron-Parse-1.1 | 0.9785 | 0.014 | 0.9858 | 0.9623 |
| Nemotron-Parse-1.1-TC | 0.9755 | 0.014 | 0.9838 | 0.9582 |
On OmniDocBench 1.0, Nemotron-Parse 1.1 exhibits table, formula, and reading-order scores that outperform all other end-to-end models under a 4,000 vision-token constraint. PubTabNet and RD-TableBench benchmarks show TEDS > 80 and table similarity ≈ 86.
5. Token-Compressed Variant (Nemotron-Parse-1.1-TC) and Throughput
The token-compressed (TC) variant utilizes a pixel-shuffle operation to decrease vision encoder output to 833 tokens. This modification yields a 20% throughput gain (from ~3,800 tokens/sec to 4,500 tokens/sec or ~4–5 pages/sec at 1,000 tokens/page) with negligible impact on main accuracy metrics (F1 drop ≤0.005 on GOT; ≤0.003 on OmniDocBench). Notably, reading-order consistency slightly improves due to integrated float-element ordering.
6. Released Artifacts, Integrations, and Deployment
NVIDIA provides both model weights (fp32/bf16) and an optimized NIM inference container:
- Model Weights (HuggingFace):
- Nemotron-Parse-v1.1: https://huggingface.co/nvidia/NVIDIA-Nemotron-Parse-v1.1
- Nemotron-Parse-v1.1-TC: https://huggingface.co/nvidia/NVIDIA-Nemotron-Parse-v1.1-TC
- NIM Container: Optimized for production scaling—https://build.nvidia.com/nvidia/nemotron-parse
- Datasets: The Nemotron-VLM-Dataset-v2, including NVpdftex synthetic, Common Crawl, DocLayNet, and PubTabNet.
Deployment scenarios encompass batch scientific paper processing, interactive document understanding, and edge/cloud-edge orchestration due to the favorable tradeoff between resource usage and throughput in the TC variant. For in-domain adaptation, continued supervised fine-tuning with customized prompt tokens (JSON, HTML, semantic classes) is supported. Knowledge distillation enables further parameter budget reduction for resource-constrained use cases.
Integration with LLM back-ends enables complex downstream tasks such as semantic question answering or table querying, promoting document understanding workflows from pixels to structured and semantically annotated textual representations (Chumachenko et al., 25 Nov 2025).