Image-to-Markup Generation with Coarse-to-Fine Attention

Published 16 Sep 2016 in cs.CV, cs.CL, cs.LG, and cs.NE | (1609.04938v2)

Abstract: We present a neural encoder-decoder model to convert images into presentational markup based on a scalable coarse-to-fine attention mechanism. Our method is evaluated in the context of image-to-LaTeX generation, and we introduce a new dataset of real-world rendered mathematical expressions paired with LaTeX markup. We show that unlike neural OCR techniques using CTC-based models, attention-based approaches can tackle this non-standard OCR task. Our approach outperforms classical mathematical OCR systems by a large margin on in-domain rendered data, and, with pretraining, also performs well on out-of-domain handwritten data. To reduce the inference complexity associated with the attention-based approaches, we introduce a new coarse-to-fine attention layer that selects a support region before applying attention.

Abstract PDF Upgrade to Chat

Citations (218)

View on Semantic Scholar

Summary

The paper introduces a neural encoder-decoder model that leverages a coarse-to-fine attention mechanism to accurately convert images into structured LaTeX markup.
It employs a multi-layer CNN and row encoder to maintain spatial locality and sequential layout, significantly improving over traditional CTC-based OCR.
Experimental results on the Im2Latex-100k dataset demonstrate enhanced performance and computational efficiency, highlighting practical implications for mathematical OCR.

Image-to-Markup Generation with Coarse-to-Fine Attention

Introduction

The paper introduces a neural encoder-decoder framework designed to convert images into presentational markup using a scalable coarse-to-fine attention mechanism. This model is particularly effective in the context of image-to-LaTeX generation. Unlike traditional OCR systems that rely heavily on CTC-based models, this attention-driven approach demonstrates superior capability in non-standard OCR tasks, particularly with mathematical expressions.

Model Architecture

The proposed model integrates several neural components for effective image-to-markup conversion:

Convolutional Neural Network (CNN): A multi-layer CNN extracts visual features from the input image, creating a feature grid without fully-connected layers to maintain spatial locality crucial for attention mechanisms.
Row Encoder: An RNN encodes sequential layout information across each row of the feature grid, which is essential for capturing structured information beyond CTC's left-to-right constraints.
Decoder with Attention Mechanism: The decoder RNN utilizes a conditional LLM supplemented by a visual attention mechanism. The attention is refined with coarse-to-fine granularity, which introduces a two-layer attention structure, significantly optimizing computational efficiency while maintaining accuracy.

Coarse-to-Fine Attention

The coarse-to-fine attention mechanism operates by initially focusing on a broader image region (coarse) and subsequently concentrating on finer details within that region:

Hierarchical Attention: It first attends over a coarse grid to identify key regions before refining attention using a standard fine-level approach.
Sparse Attention Variants: The model explores sparsemax and hard attention methods to reduce the computational load during attention layer lookup while maintaining performance.

Implementation and Dataset

To facilitate model evaluation, a significant contribution of the paper is the introduction of the Im2Latex-100k dataset. This dataset consists of real-world mathematical expressions in LaTeX paired with their rendered images, providing a robust test-bed for image-to-markup techniques.

Tokenization: Markup is tokenized into minimal meaningful LaTeX tokens to optimize model training efficiency.
Synthetic Handwritten Data: For addressing handwritten expressions, a synthetic dataset is created by rendering expressions using handwritten symbol images, allowing pretraining and fine-tuning on real handwritten datasets.

Experimental Results

Extensive experiments demonstrate the model's superiority compared to classical OCR systems (e.g., InftyReader) and CTC-based approaches. Key outcomes include:

Exact Match Accuracy: The model achieves competitive accuracy on both real and synthetic datasets, significantly outperforming traditional systems.
Efficiency Benefits: Coarse-to-fine attention variants exhibit reduced computational complexity with minimal accuracy trade-offs. Specifically, sparsemax and hard attention approaches effectively decrease the number of necessary fine attention computations.

Conclusion and Implications

The paper presents a novel neural model architecture that offers a compelling solution for image-to-markup generation tasks. By leveraging coarse-to-fine attention, the model achieves a favorable balance between computation efficiency and accuracy, paving the way for practical applications in mathematical expression OCR and potentially other structured document parsing tasks. Future work may explore the adaptation of this framework to multilingual document conversion and integration with neural inference mechanisms.

The research successfully demonstrates that data-driven approaches to structured text OCR can operate effectively without domain-specific engineering, expanding possibilities for automated document understanding in diverse contexts.