Papers
Topics
Authors
Recent
Search
2000 character limit reached

Nemotron-Parse-TC: Token Compression for OCR

Updated 15 January 2026
  • Nemotron-Parse-TC is a lightweight document parsing and OCR model that applies aggressive token compression to reduce vision encoder tokens by 93%.
  • It leverages a ViT-H/16 backbone with an auxiliary multi-token head to achieve a ~20% throughput increase while maintaining near-original accuracy.
  • Optimized with NVIDIA Inference Manager, the model balances enhanced processing speed with minimal degradation in key OCR and structured extraction metrics.

Nemotron-Parse-TC (Token-Compression variant) is a high-throughput, lightweight document parsing and optical character recognition (OCR) model derived from NVIDIA's Nemotron-Parse-1.1. It focuses on efficient extraction and structured understanding of visually dense documents, including complex tables, markdown-formatted text, and semantically rich layouts. Distinguished by aggressive token compression in its vision encoder, Nemotron-Parse-TC enables approximately 20% greater throughput, with negligible degradation in accuracy compared to the original full-length model. Nemotron-Parse-TC is distributed with optimized NIM (NVIDIA Inference Manager) containers, model weights, and partial training data as part of the Nemotron-VLM-v2 dataset (Chumachenko et al., 25 Nov 2025).

1. Model Architecture and Token-Compression Modifications

Nemotron-Parse-1.1 and the TC variant both utilize an encoder–decoder architecture comprising 885 million parameters. The vision encoder E\mathcal{E} is a ViT-H/16 backbone (RADIOv2.5, 657M parameters) mapping an input image IR3×H×W\mathbf{I}\in\mathbb{R}^{3\times H\times W} to patch tokens ZRN×d\mathbf{Z}\in\mathbb{R}^{N\times d}. These features are subsampled by a horizontal convolutional "neck" N\mathcal{N} (kernel 1×41\times4, stride 1×41\times4), reducing native patch sequence length NH×W162N\approx\tfrac{H\times W}{16^2} to 3200 tokens for a 1648×20481648\times2048 document.

The language decoder D\mathcal{D} is a 10-layer, 256M-parameter, tied-weights mBART transformer emitting a mixture of text, bounding-box tags, and semantic-class tags \textit{without} explicit 1D positional encodings ("NoPE"), leveraging only the causal mask for sequential order. An auxiliary multi-token head enables up to mm tokens to be predicted in parallel per step under the same cross-entropy loss.

Nemotron-Parse-1.1-TC introduces a pixel-shuffle layer post-neck, collapsing each 4×44\times4 block of vision features, downsampling the vision token sequence to

LTC=32004×4=833L_{\text{TC}} = \frac{3200}{4\times4} = 833

tokens. All subsequent attention and cross-attention layers consume only the 833-token sequence.

2. Tokenization, Attention, and Computational Complexity

No changes are made to the text-tokenizer or decoder vocabulary. Bounding box coordinates and semantic classes remain encoded as explicit tokens, e.g., <x_d>,<y_d>\verb|<x_d>|,\verb|<y_d>|. Self- and cross-attention mechanisms remain algorithmically identical; however, the reduction in sequence length LL results in a quadratic decrease in computation for transformer attention: O(L2d+Ld2)O(L2d)\mathcal{O}(L^2d + L d^2) \approx \mathcal{O}(L^2d) Replacing Lorig=3200L_{\text{orig}}=3200 with LTC=833L_{\text{TC}}=833 yields a reduction

(LTCLorig)2=(8333200)20.068\left(\frac{L_{\text{TC}}}{L_{\text{orig}}}\right)^2 = \left(\frac{833}{3200}\right)^2 \approx 0.068

i.e., a  93%~93\% reduction in encoder attention cost. Observed speedup is 20%\sim20\% due to decoder and I/O bottlenecks.

GPU memory usage for activations in the vision encoder is reduced in proportion to token count, 26%\approx 26\% of the original. Token generation speed on an NVIDIA H100 (bf16) averaged over 10,000 pages (1,000 tokens/page) is 4,500 tokens/sec (TC variant) versus 3,800 tokens/sec (full-length), corresponding to processing 5\sim5 vs. 4\sim4 pages/sec, respectively.

3. Benchmark Performance and Trade-offs

Nemotron-Parse-TC displays comparable accuracy to the full model across core benchmarks, with minor degradation and occasional improvements in order-based metrics.

Model Variant WER ↓ F1 ↑ Vision Tokens OCR F1 ↑ RO Edit ↓ RO BLEU ↑ OmniDocBench overall ↑ Table TEDS/S-TEDS
Nemotron-Parse-1.1 0.102 0.957 3200 0.9785 0.014 0.9623 0.131 86.2/79.9
Nemotron-Parse-1.1-TC 0.121 0.949 833 0.9755 0.014 0.9582 0.129 85.3/79.6

In table extraction tasks (RD-TableBench, PubTabNet, OmniDocBench), performance differences are sub-1%. Reading-order F1 can modestly improve with TC's block grouping. Trade-off curves indicate that for the 20%\sim20\% speed gain, losses in word error rate (WER) or table structure metrics (TEDS/S-TEDS) are typically below 1%.

4. Structured Extraction: Bounding Boxes and Semantic Classes

Nemotron-Parse-1.1-TC enforces explicit representation for both spatial and semantic entities. Each detected box is output as four tags: <x_{x_1}><y_{y_1}>  text  <x_{x_2}><y_{y_2}>  <class_c>\verb|<x_{x_1}><y_{y_1}>|\;\text{text}\;\verb|<x_{x_2}><y_{y_2}>|\;\verb|<class_c>| Coordinates are normalized to a 1024×12801024\times1280 reference grid (e.g., <x_0.1152><y_0.2586># NVIDIA Nemotron-Parse 1.1<x_0.8799><y_0.2797><class_Title>\verb|<x_0.1152><y_0.2586># NVIDIA Nemotron-Parse 1.1<x_0.8799><y_0.2797><class_Title>|).

Semantic-class tags (e.g., Title, Text, Section-Header, List-Item, Formula, Table, Picture, Caption, Footnote) follow each bounding box tuple. All character recognition, localization, and classification tasks are optimized under a single autoregressive cross-entropy objective: L=i=1LlogP(tiN(Z),t<i)\mathcal{L} = -\sum_{i=1}^L \log P(t_i|\mathcal{N}(\mathbf{Z}), t_{<i}) with t1tLt_1 \ldots t_L interleaving markdown content, spatial tags, and class identifiers.

5. Deployment and Ecosystem Integration

Nemotron-Parse-1.1-TC is distributed via an optimized NIM (NVIDIA Inference Manager) container using TensorRT, supporting bf16/fp32 precision and VLLM integration. The executable graph encapsulates encoder, neck, decoder components, attention kernel fusion, and multi-token heads with dynamic memory planning tailored for reduced token count.

Deployment metrics on a single H100 GPU (bf16) are as follows:

  • Throughput: 4\sim4 pages/sec (full), 5\sim5 pages/sec (TC variant) at 1,000 tokens/page
  • Memory footprint: Activation memory in the encoder reduced by 75%\approx75\% (32008333200 \to 833 tokens)
  • FLOPs per token: Unchanged; total FLOPs for vision encoder reduced by (833/3200)2(833/3200)^2

Weights, training data subsets, and inference containers are distributed via Huggingface and the Nemotron-VLM-v2 dataset.

6. Context and Significance

Nemotron-Parse-1.1-TC demonstrates that aggressive token-compression in vision transformers enables substantial throughput improvements for document parsing and OCR, with negligible accuracy degradation. This suggests that, for visually dense but structurally regular documents, spatial grouping followed by token sequence reduction can maintain critical information flow for downstream transformers.

A plausible implication is that similar token-compression schemes may be applicable for large-scale VLMs in production environments where resource constraints, latency, and throughput are paramount, and semantic fidelity must be preserved. The negligible trade-off curves reinforce the viability of compression-first approaches for visually structured data parsing.

Nemotron-Parse-1.1-TC is positioned as a reference lightweight solution for high-speed, high-fidelity extraction of structured document semantics, with extensibility for further work in large-scale visual-language modeling and information extraction (Chumachenko et al., 25 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Nemotron-Parse-TC.