Nemotron-Parse-TC: Token Compression for OCR
- Nemotron-Parse-TC is a lightweight document parsing and OCR model that applies aggressive token compression to reduce vision encoder tokens by 93%.
- It leverages a ViT-H/16 backbone with an auxiliary multi-token head to achieve a ~20% throughput increase while maintaining near-original accuracy.
- Optimized with NVIDIA Inference Manager, the model balances enhanced processing speed with minimal degradation in key OCR and structured extraction metrics.
Nemotron-Parse-TC (Token-Compression variant) is a high-throughput, lightweight document parsing and optical character recognition (OCR) model derived from NVIDIA's Nemotron-Parse-1.1. It focuses on efficient extraction and structured understanding of visually dense documents, including complex tables, markdown-formatted text, and semantically rich layouts. Distinguished by aggressive token compression in its vision encoder, Nemotron-Parse-TC enables approximately 20% greater throughput, with negligible degradation in accuracy compared to the original full-length model. Nemotron-Parse-TC is distributed with optimized NIM (NVIDIA Inference Manager) containers, model weights, and partial training data as part of the Nemotron-VLM-v2 dataset (Chumachenko et al., 25 Nov 2025).
1. Model Architecture and Token-Compression Modifications
Nemotron-Parse-1.1 and the TC variant both utilize an encoder–decoder architecture comprising 885 million parameters. The vision encoder is a ViT-H/16 backbone (RADIOv2.5, 657M parameters) mapping an input image to patch tokens . These features are subsampled by a horizontal convolutional "neck" (kernel , stride ), reducing native patch sequence length to 3200 tokens for a document.
The language decoder is a 10-layer, 256M-parameter, tied-weights mBART transformer emitting a mixture of text, bounding-box tags, and semantic-class tags \textit{without} explicit 1D positional encodings ("NoPE"), leveraging only the causal mask for sequential order. An auxiliary multi-token head enables up to tokens to be predicted in parallel per step under the same cross-entropy loss.
Nemotron-Parse-1.1-TC introduces a pixel-shuffle layer post-neck, collapsing each block of vision features, downsampling the vision token sequence to
tokens. All subsequent attention and cross-attention layers consume only the 833-token sequence.
2. Tokenization, Attention, and Computational Complexity
No changes are made to the text-tokenizer or decoder vocabulary. Bounding box coordinates and semantic classes remain encoded as explicit tokens, e.g., . Self- and cross-attention mechanisms remain algorithmically identical; however, the reduction in sequence length results in a quadratic decrease in computation for transformer attention: Replacing with yields a reduction
i.e., a reduction in encoder attention cost. Observed speedup is due to decoder and I/O bottlenecks.
GPU memory usage for activations in the vision encoder is reduced in proportion to token count, of the original. Token generation speed on an NVIDIA H100 (bf16) averaged over 10,000 pages (1,000 tokens/page) is 4,500 tokens/sec (TC variant) versus 3,800 tokens/sec (full-length), corresponding to processing vs. pages/sec, respectively.
3. Benchmark Performance and Trade-offs
Nemotron-Parse-TC displays comparable accuracy to the full model across core benchmarks, with minor degradation and occasional improvements in order-based metrics.
| Model Variant | WER ↓ | F1 ↑ | Vision Tokens | OCR F1 ↑ | RO Edit ↓ | RO BLEU ↑ | OmniDocBench overall ↑ | Table TEDS/S-TEDS |
|---|---|---|---|---|---|---|---|---|
| Nemotron-Parse-1.1 | 0.102 | 0.957 | 3200 | 0.9785 | 0.014 | 0.9623 | 0.131 | 86.2/79.9 |
| Nemotron-Parse-1.1-TC | 0.121 | 0.949 | 833 | 0.9755 | 0.014 | 0.9582 | 0.129 | 85.3/79.6 |
In table extraction tasks (RD-TableBench, PubTabNet, OmniDocBench), performance differences are sub-1%. Reading-order F1 can modestly improve with TC's block grouping. Trade-off curves indicate that for the speed gain, losses in word error rate (WER) or table structure metrics (TEDS/S-TEDS) are typically below 1%.
4. Structured Extraction: Bounding Boxes and Semantic Classes
Nemotron-Parse-1.1-TC enforces explicit representation for both spatial and semantic entities. Each detected box is output as four tags: Coordinates are normalized to a reference grid (e.g., ).
Semantic-class tags (e.g., Title, Text, Section-Header, List-Item, Formula, Table, Picture, Caption, Footnote) follow each bounding box tuple. All character recognition, localization, and classification tasks are optimized under a single autoregressive cross-entropy objective: with interleaving markdown content, spatial tags, and class identifiers.
5. Deployment and Ecosystem Integration
Nemotron-Parse-1.1-TC is distributed via an optimized NIM (NVIDIA Inference Manager) container using TensorRT, supporting bf16/fp32 precision and VLLM integration. The executable graph encapsulates encoder, neck, decoder components, attention kernel fusion, and multi-token heads with dynamic memory planning tailored for reduced token count.
Deployment metrics on a single H100 GPU (bf16) are as follows:
- Throughput: pages/sec (full), pages/sec (TC variant) at 1,000 tokens/page
- Memory footprint: Activation memory in the encoder reduced by ( tokens)
- FLOPs per token: Unchanged; total FLOPs for vision encoder reduced by
Weights, training data subsets, and inference containers are distributed via Huggingface and the Nemotron-VLM-v2 dataset.
6. Context and Significance
Nemotron-Parse-1.1-TC demonstrates that aggressive token-compression in vision transformers enables substantial throughput improvements for document parsing and OCR, with negligible accuracy degradation. This suggests that, for visually dense but structurally regular documents, spatial grouping followed by token sequence reduction can maintain critical information flow for downstream transformers.
A plausible implication is that similar token-compression schemes may be applicable for large-scale VLMs in production environments where resource constraints, latency, and throughput are paramount, and semantic fidelity must be preserved. The negligible trade-off curves reinforce the viability of compression-first approaches for visually structured data parsing.
Nemotron-Parse-1.1-TC is positioned as a reference lightweight solution for high-speed, high-fidelity extraction of structured document semantics, with extensibility for further work in large-scale visual-language modeling and information extraction (Chumachenko et al., 25 Nov 2025).