Nemotron-Parse-TC: Token Compression for OCR

Updated 15 January 2026

Nemotron-Parse-TC is a lightweight document parsing and OCR model that applies aggressive token compression to reduce vision encoder tokens by 93%.
It leverages a ViT-H/16 backbone with an auxiliary multi-token head to achieve a ~20% throughput increase while maintaining near-original accuracy.
Optimized with NVIDIA Inference Manager, the model balances enhanced processing speed with minimal degradation in key OCR and structured extraction metrics.

Nemotron-Parse-TC (Token-Compression variant) is a high-throughput, lightweight document parsing and optical character recognition (OCR) model derived from NVIDIA's Nemotron-Parse-1.1. It focuses on efficient extraction and structured understanding of visually dense documents, including complex tables, markdown-formatted text, and semantically rich layouts. Distinguished by aggressive token compression in its vision encoder, Nemotron-Parse-TC enables approximately 20% greater throughput, with negligible degradation in accuracy compared to the original full-length model. Nemotron-Parse-TC is distributed with optimized NIM (NVIDIA Inference Manager) containers, model weights, and partial training data as part of the Nemotron-VLM-v2 dataset (Chumachenko et al., 25 Nov 2025).

1. Model Architecture and Token-Compression Modifications

Nemotron-Parse-1.1 and the TC variant both utilize an encoder–decoder architecture comprising 885 million parameters. The vision encoder $\mathcal{E}$ is a ViT-H/16 backbone (RADIOv2.5, 657M parameters) mapping an input image $\mathbf{I}\in\mathbb{R}^{3\times H\times W}$ to patch tokens $\mathbf{Z}\in\mathbb{R}^{N\times d}$ . These features are subsampled by a horizontal convolutional "neck" $\mathcal{N}$ (kernel $1\times4$ , stride $1\times4$ ), reducing native patch sequence length $N\approx\tfrac{H\times W}{16^2}$ to 3200 tokens for a $1648\times2048$ document.

The language decoder $\mathcal{D}$ is a 10-layer, 256M-parameter, tied-weights mBART transformer emitting a mixture of text, bounding-box tags, and semantic-class tags \textit{without} explicit 1D positional encodings ("NoPE"), leveraging only the causal mask for sequential order. An auxiliary multi-token head enables up to $m$ tokens to be predicted in parallel per step under the same cross-entropy loss.

Nemotron-Parse-1.1-TC introduces a pixel-shuffle layer post-neck, collapsing each $4\times4$ block of vision features, downsampling the vision token sequence to

$L_{\text{TC}} = \frac{3200}{4\times4} = 833$

tokens. All subsequent attention and cross-attention layers consume only the 833-token sequence.

2. Tokenization, Attention, and Computational Complexity

No changes are made to the text-tokenizer or decoder vocabulary. Bounding box coordinates and semantic classes remain encoded as explicit tokens, e.g., $\verb|<x_d>|,\verb|<y_d>|$ . Self- and cross-attention mechanisms remain algorithmically identical; however, the reduction in sequence length $L$ results in a quadratic decrease in computation for transformer attention: $\mathcal{O}(L^2d + L d^2) \approx \mathcal{O}(L^2d)$ Replacing $L_{\text{orig}}=3200$ with $L_{\text{TC}}=833$ yields a reduction

$\left(\frac{L_{\text{TC}}}{L_{\text{orig}}}\right)^2 = \left(\frac{833}{3200}\right)^2 \approx 0.068$

i.e., a $~93\%$ reduction in encoder attention cost. Observed speedup is $\sim20\%$ due to decoder and I/O bottlenecks.

GPU memory usage for activations in the vision encoder is reduced in proportion to token count, $\approx 26\%$ of the original. Token generation speed on an NVIDIA H100 (bf16) averaged over 10,000 pages (1,000 tokens/page) is 4,500 tokens/sec (TC variant) versus 3,800 tokens/sec (full-length), corresponding to processing $\sim5$ vs. $\sim4$ pages/sec, respectively.

3. Benchmark Performance and Trade-offs

Nemotron-Parse-TC displays comparable accuracy to the full model across core benchmarks, with minor degradation and occasional improvements in order-based metrics.

Model Variant	WER ↓	F1 ↑	Vision Tokens	OCR F1 ↑	RO Edit ↓	RO BLEU ↑	OmniDocBench overall ↑	Table TEDS/S-TEDS
Nemotron-Parse-1.1	0.102	0.957	3200	0.9785	0.014	0.9623	0.131	86.2/79.9
Nemotron-Parse-1.1-TC	0.121	0.949	833	0.9755	0.014	0.9582	0.129	85.3/79.6

In table extraction tasks (RD-TableBench, PubTabNet, OmniDocBench), performance differences are sub-1%. Reading-order F1 can modestly improve with TC's block grouping. Trade-off curves indicate that for the $\sim20\%$ speed gain, losses in word error rate (WER) or table structure metrics (TEDS/S-TEDS) are typically below 1%.

4. Structured Extraction: Bounding Boxes and Semantic Classes

Nemotron-Parse-1.1-TC enforces explicit representation for both spatial and semantic entities. Each detected box is output as four tags: $\verb|<x_{x_1}><y_{y_1}>|\;\text{text}\;\verb|<x_{x_2}><y_{y_2}>|\;\verb|<class_c>|$ Coordinates are normalized to a $1024\times1280$ reference grid (e.g., $\verb|<x_0.1152><y_0.2586># NVIDIA Nemotron-Parse 1.1<x_0.8799><y_0.2797><class_Title>|$ ).

Semantic-class tags (e.g., Title, Text, Section-Header, List-Item, Formula, Table, Picture, Caption, Footnote) follow each bounding box tuple. All character recognition, localization, and classification tasks are optimized under a single autoregressive cross-entropy objective: $\mathcal{L} = -\sum_{i=1}^L \log P(t_i|\mathcal{N}(\mathbf{Z}), t_{<i})$ with $t_1 \ldots t_L$ interleaving markdown content, spatial tags, and class identifiers.

5. Deployment and Ecosystem Integration

Nemotron-Parse-1.1-TC is distributed via an optimized NIM (NVIDIA Inference Manager) container using TensorRT, supporting bf16/fp32 precision and VLLM integration. The executable graph encapsulates encoder, neck, decoder components, attention kernel fusion, and multi-token heads with dynamic memory planning tailored for reduced token count.

Deployment metrics on a single H100 GPU (bf16) are as follows:

Throughput: $\sim4$ pages/sec (full), $\sim5$ pages/sec (TC variant) at 1,000 tokens/page
Memory footprint: Activation memory in the encoder reduced by $\approx75\%$ ( $3200 \to 833$ tokens)
FLOPs per token: Unchanged; total FLOPs for vision encoder reduced by $(833/3200)^2$

Weights, training data subsets, and inference containers are distributed via Huggingface and the Nemotron-VLM-v2 dataset.

6. Context and Significance

Nemotron-Parse-1.1-TC demonstrates that aggressive token-compression in vision transformers enables substantial throughput improvements for document parsing and OCR, with negligible accuracy degradation. This suggests that, for visually dense but structurally regular documents, spatial grouping followed by token sequence reduction can maintain critical information flow for downstream transformers.

A plausible implication is that similar token-compression schemes may be applicable for large-scale VLMs in production environments where resource constraints, latency, and throughput are paramount, and semantic fidelity must be preserved. The negligible trade-off curves reinforce the viability of compression-first approaches for visually structured data parsing.

Nemotron-Parse-1.1-TC is positioned as a reference lightweight solution for high-speed, high-fidelity extraction of structured document semantics, with extensibility for further work in large-scale visual-language modeling and information extraction (Chumachenko et al., 25 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

NVIDIA Nemotron Parse 1.1 (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Nemotron-Parse-TC.