Papers
Topics
Authors
Recent
Search
2000 character limit reached

Rerunning OCR: A Machine Learning Approach to Quality Assessment and Enhancement Prediction

Published 4 Oct 2021 in cs.CL, cs.AI, and cs.LG | (2110.01661v5)

Abstract: Iterating with new and improved OCR solutions enforces decision making when it comes to targeting the right candidates for reprocessing. This especially applies when the underlying data collection is of considerable size and rather diverse in terms of fonts, languages, periods of publication and consequently OCR quality. This article captures the efforts of the National Library of Luxembourg to support those targeting decisions. They are crucial in order to guarantee low computational overhead and reduced quality degradation risks, combined with a more quantifiable OCR improvement. In particular, this work explains the methodology of the library with respect to text block level quality assessment. Through extension of this technique, a regression model, that is able to take into account the enhancement potential of a new OCR engine, is also presented. They both mark promising approaches, especially for cultural institutions dealing with historical data of lower quality.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. B. Alex and J. Burns. Estimating and rating the quality of optically character recognised text. ACM International Conference Proceeding Series, 2014.
  2. Prediction of ocr accuracy using simple image features. page 319, 1995.
  3. N-gram-based text categorization. Ann Arbor MI, 1994.
  4. J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20:37–46, 1960.
  5. D. Doermann and K. Tombre. Handbook of Document Image Processing and Recognition. Springer Publishing Company, Incorporated, 2014.
  6. Automatic assessment of ocr quality in historical documents. page 1735–1741, 2015.
  7. M. Hill and S. Hengchen. Quantifying the impact of dirty ocr on historical text analysis: Eighteenth century collections online as a case study. Digit. Scholarsh. Humanit., 34:825–843, 2019.
  8. A. Kay. Tesseract: an open-source optical character recognition engine. Linux Journal, 2007.
  9. Ocr accuracy prediction method based on blur estimation. pages 317–322, 2016. 10.1109/DAS.2016.50.
  10. S. Kulp and K. April. On retrieving legal files: Shortening documents and weeding out garbage. Special Publication 500-274, 2007.
  11. V. Levenshtein. Binary codes capable of correcting spurious insertions and deletions of ones. Problems of Information Transmission, 1:8–17, 1965.
  12. Noise characterization for historical documents with physical distortions. 11353:77–87, 2020. 10.1117/12.2559694.
  13. M. Lui and T. Baldwin. Langid.py: An off-the-shelf language identification tool. pages 25–30, 2012.
  14. Y. Maurer. Improving the quality of the text, a pilot project to assess and correct the ocr in a multilingual environment. Relying on News Media. Long Term Preservation and Perspectives for Our Collective Memorey, 2017.
  15. Document image ocr accuracy prediction via latent dirichlet allocation. pages 771–775, 2015.
  16. Ocr performance prediction using cross-ocr alignment. pages 556–560, 2015. 10.1109/ICDAR.2015.7333823.
  17. R. Schaefer and C. Neudecker. A two-step approach for automatic OCR post-correction. pages 52–57, 2020.
  18. Learning surrogate models of document image quality metrics for automated document image processing. pages 67–72, 2018. 10.1109/DAS.2018.14.
  19. Automatic quality evaluation and (semi-) automatic improvement of ocr models for historical printings. arXiv: Digital Libraries, 2016.
  20. Assessing the impact of ocr quality on downstream nlp tasks. 2020.
  21. Automatic removal of garbage strings in ocr text: An implementation. The 5th World Multi-Conference on Systemics, Cybernetics and Informatics, 2001.
  22. C.J. Willmott and K. Matsuura. Advantages of the mean absolute error (mae) over the root mean square error (rmse) in assessing average model performance. Climate Research, 30(1):79–82, 2005.
  23. Recognizing garbage in ocr output on historical documents. Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data, 2011. 10.1145/2034617.2034626.
  24. G.K. Zipf. Human Behaviour and the Principle of Least Effort. Addison-Wesley, 1949.
Citations (7)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.