Papers
Topics
Authors
Recent
Search
2000 character limit reached

LOCR: Location-Guided Transformer for Optical Character Recognition

Published 4 Mar 2024 in cs.CV, cs.AI, and cs.CL | (2403.02127v1)

Abstract: Academic documents are packed with texts, equations, tables, and figures, requiring comprehensive understanding for accurate Optical Character Recognition (OCR). While end-to-end OCR methods offer improved accuracy over layout-based approaches, they often grapple with significant repetition issues, especially with complex layouts in Out-Of-Domain (OOD) documents.To tackle this issue, we propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression. We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols. LOCR adeptly handles various formatting elements and generates content in Markdown language. It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.LOCR also reduces repetition frequency from 4.4% of pages to 0.5% in the arXiv dataset, from 13.2% to 1.3% in OOD quantum physics documents and from 8.1% to 1.8% in OOD marketing documents. Additionally, LOCR features an interactive OCR mode, facilitating the generation of complex documents through a few location prompts from human.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  2. Darwin Bautista and Rowel Atienza. 2022. Scene Text Recognition with Permuted Autoregressive Sequence Models. arXiv e-prints, page arXiv:2207.06966.
  3. Nougat: Neural optical understanding for academic documents. arXiv preprint arXiv:2308.13418.
  4. EfficientOCR: An Extensible, Open-Source Package for Efficiently Digitizing World Knowledge. arXiv e-prints, page arXiv:2310.10050.
  5. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6569–6578.
  6. Rethinking Text Line Recognition Models. arXiv e-prints, page arXiv:2104.07787.
  7. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751.
  8. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. arXiv e-prints, page arXiv:2204.08387.
  9. T-rex: Counting by visual prompting. arXiv preprint arXiv:2311.13596.
  10. Ocr-free document understanding transformer. In European Conference on Computer Vision (ECCV).
  11. Segment anything. arXiv preprint arXiv:2304.02643.
  12. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
  13. PP-StructureV2: A Stronger Document Analysis System. arXiv e-prints, page arXiv:2210.05391.
  14. Visual in-context prompting. arXiv preprint arXiv:2311.13601.
  15. TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models. arXiv e-prints, page arXiv:2109.10282.
  16. Doctr: Document transformer for structured information extraction in documents. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19584–19594.
  17. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022.
  18. Unified structure generation for universal information extraction. arXiv preprint arXiv:2203.12277.
  19. mindee. 2023. doctr: Document text recognition. https://github.com/mindee/doctr.
  20. Full-Page Text Recognition: Learning Where to Start and When to Stop. arXiv e-prints, page arXiv:1704.08628.
  21. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  22. Vik Paruchuri and Samuel Lampa. 2023. Marker: Convert pdf to markdown quickly with high accuracy. https://github.com/VikParuchuri/marker?tab=readme-ov-file.
  23. Sakshi Sakshi and Vinay Kukreja. 2023. Recent trends in mathematical expressions recognition: An lda-based analysis. Expert Systems with Applications, 213:119028.
  24. An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition. arXiv e-prints, page arXiv:1507.05717.
  25. Exploring ocr capabilities of gpt-4v (ision): A quantitative and in-depth evaluation. arXiv preprint arXiv:2310.16809.
  26. Best practices for convolutional neural networks applied to visual document analysis. In Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings., pages 958–963.
  27. R. Smith. 2007. An overview of the tesseract ocr engine. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), volume 2, pages 629–633.
  28. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. arXiv e-prints, page arXiv:2006.10739.
  29. Vary: Scaling up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109.
  30. LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding. arXiv e-prints, page arXiv:2012.14740.
  31. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. arXiv e-prints, page arXiv:1912.13318.
  32. Tableformer: Robust transformer modeling for table-text encoding. arXiv preprint arXiv:2203.00274.
  33. The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision). arXiv e-prints, page arXiv:2309.17421.
  34. Cong Yao. 2023. DocXChain: A Powerful Open-Source Toolchain for Document Parsing and Beyond. arXiv e-prints, page arXiv:2310.12430.
  35. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. arXiv e-prints, page arXiv:1911.08287.
  36. EAST: An Efficient and Accurate Scene Text Detector. arXiv e-prints, page arXiv:1704.03155.
  37. DocBed: A Multi-Stage OCR Solution for Documents with Complex Layouts. arXiv e-prints, page arXiv:2202.01414.
Citations (1)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 3 likes about this paper.