- The paper introduces a neural encoder-decoder model that leverages a coarse-to-fine attention mechanism to accurately convert images into structured LaTeX markup.
- It employs a multi-layer CNN and row encoder to maintain spatial locality and sequential layout, significantly improving over traditional CTC-based OCR.
- Experimental results on the Im2Latex-100k dataset demonstrate enhanced performance and computational efficiency, highlighting practical implications for mathematical OCR.
Image-to-Markup Generation with Coarse-to-Fine Attention
Introduction
The paper introduces a neural encoder-decoder framework designed to convert images into presentational markup using a scalable coarse-to-fine attention mechanism. This model is particularly effective in the context of image-to-LaTeX generation. Unlike traditional OCR systems that rely heavily on CTC-based models, this attention-driven approach demonstrates superior capability in non-standard OCR tasks, particularly with mathematical expressions.
Model Architecture
The proposed model integrates several neural components for effective image-to-markup conversion:
- Convolutional Neural Network (CNN): A multi-layer CNN extracts visual features from the input image, creating a feature grid without fully-connected layers to maintain spatial locality crucial for attention mechanisms.
- Row Encoder: An RNN encodes sequential layout information across each row of the feature grid, which is essential for capturing structured information beyond CTC's left-to-right constraints.
- Decoder with Attention Mechanism: The decoder RNN utilizes a conditional LLM supplemented by a visual attention mechanism. The attention is refined with coarse-to-fine granularity, which introduces a two-layer attention structure, significantly optimizing computational efficiency while maintaining accuracy.
Coarse-to-Fine Attention
The coarse-to-fine attention mechanism operates by initially focusing on a broader image region (coarse) and subsequently concentrating on finer details within that region:
- Hierarchical Attention: It first attends over a coarse grid to identify key regions before refining attention using a standard fine-level approach.
- Sparse Attention Variants: The model explores sparsemax and hard attention methods to reduce the computational load during attention layer lookup while maintaining performance.
Implementation and Dataset
To facilitate model evaluation, a significant contribution of the paper is the introduction of the Im2Latex-100k dataset. This dataset consists of real-world mathematical expressions in LaTeX paired with their rendered images, providing a robust test-bed for image-to-markup techniques.
- Tokenization: Markup is tokenized into minimal meaningful LaTeX tokens to optimize model training efficiency.
- Synthetic Handwritten Data: For addressing handwritten expressions, a synthetic dataset is created by rendering expressions using handwritten symbol images, allowing pretraining and fine-tuning on real handwritten datasets.
Experimental Results
Extensive experiments demonstrate the model's superiority compared to classical OCR systems (e.g., InftyReader) and CTC-based approaches. Key outcomes include:
- Exact Match Accuracy: The model achieves competitive accuracy on both real and synthetic datasets, significantly outperforming traditional systems.
- Efficiency Benefits: Coarse-to-fine attention variants exhibit reduced computational complexity with minimal accuracy trade-offs. Specifically, sparsemax and hard attention approaches effectively decrease the number of necessary fine attention computations.
Conclusion and Implications
The paper presents a novel neural model architecture that offers a compelling solution for image-to-markup generation tasks. By leveraging coarse-to-fine attention, the model achieves a favorable balance between computation efficiency and accuracy, paving the way for practical applications in mathematical expression OCR and potentially other structured document parsing tasks. Future work may explore the adaptation of this framework to multilingual document conversion and integration with neural inference mechanisms.
The research successfully demonstrates that data-driven approaches to structured text OCR can operate effectively without domain-specific engineering, expanding possibilities for automated document understanding in diverse contexts.