Improving OCR Quality in 19th Century Historical Documents Using a Combined Machine Learning Based Approach

Published 15 Jan 2024 in cs.CV and cs.LG | (2401.07787v1)

Abstract: This paper addresses a major challenge to historical research on the 19th century. Large quantities of sources have become digitally available for the first time, while extraction techniques are lagging behind. Therefore, we researched ML models to recognise and extract complex data structures in a high-value historical primary source, the Schematismus. It records every single person in the Habsburg civil service above a certain hierarchical level between 1702 and 1918 and documents the genesis of the central administration over two centuries. Its complex and intricate structure as well as its enormous size have so far made any more comprehensive analysis of the administrative and social structure of the later Habsburg Empire on the basis of this source impossible. We pursued two central objectives: Primarily, the improvement of the OCR quality, for which we considered an improved structure recognition to be essential; in the further course, it turned out that this also made the extraction of the data structure possible. We chose Faster R-CNN as base for the ML architecture for structure recognition. In order to obtain the required amount of training data quickly and economically, we synthesised Hof- und Staatsschematismus-style data, which we used to train our model. The model was then fine-tuned with a smaller set of manually annotated historical source data. We then used Tesseract-OCR, which was further optimised for the style of our documents, to complete the combined structure extraction and OCR process. Results show a significant decrease in the two standard parameters of OCR-performance, WER and CER (where lower values are better). Combined structure detection and fine-tuned OCR improved CER and WER values by remarkable 71.98 percent (CER) respectively 52.49 percent (WER).

Abstract PDF Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that combining machine learning with optimized OCR significantly improves text extraction from complex 19th-century documents.
It employs Faster R-CNN for structure recognition, using synthetic data and fine-tuning with actual historical samples for robust training.
The approach achieves a 71.98% reduction in CER and a 52.49% reduction in WER, enabling more accurate analysis of historical data.

The paper "Improving OCR Quality in 19th Century Historical Documents Using a Combined Machine Learning Based Approach" tackles the challenge of extracting textual data from complex 19th-century historical documents, specifically the "Schematismus" of the Habsburg civil service. This document is a valuable resource as it contains detailed records spanning from 1702 to 1918. However, its complexity and volume have made comprehensive analysis difficult.

The authors focus on enhancing Optical Character Recognition (OCR) technology to better handle these intricate documents. The primary goal is to improve OCR quality by employing advanced structure recognition techniques. A secondary goal emerges from this, which is the improved extraction of complex data structures from historical sources.

Methodology

Machine Learning Approach: The research leverages a machine learning model, Faster R-CNN, for structure recognition within the documents. This is crucial because identifying the structure of the document aids in improving the overall OCR process.
Data Synthesis for Training: To train the model efficiently, the authors synthesized data styled after the "Hof- und Staatsschematismus" to create a robust training dataset. This synthetic data helped in quickly generating the required amount of training material.
Fine-Tuning with Historical Data: After the initial training, the model underwent fine-tuning with a smaller set of manually annotated data from the actual historical documents, enhancing its accuracy and performance on real-world data.
Optimized OCR Process: Alongside the structure recognition, Tesseract-OCR was used and further optimized for the specific stylistic attributes of the documents. This combination of technologies facilitated a more effective OCR process by aligning structure recognition with the text extraction process.

Results

The approach led to significant improvements in two standard OCR performance metrics:

Character Error Rate (CER): Improved by 71.98%, reflecting a substantial reduction in character recognition errors.
Word Error Rate (WER): Improved by 52.49%, demonstrating enhanced accuracy in word recognition.

These improvements highlight the effectiveness of combining advanced structure recognition with fine-tuned OCR processes in dealing with historically complex documents. This research represents a significant step forward in allowing comprehensive analysis of the Habsburg Empire's administrative and social structure using historical records that were previously challenging to digitize accurately.

Markdown Report Issue