2000 character limit reached
Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction
Published 4 Jul 2024 in cs.CL and cs.DL | (2407.12838v2)
Abstract: This paper presents two significant contributions: First, it introduces a novel dataset of 19th-century Latin American newspaper texts, addressing a critical gap in specialized corpora for historical and linguistic analysis in this region. Second, it develops a flexible framework that utilizes a LLM for OCR error correction and linguistic surface form detection in digitized corpora. This semi-automated framework is adaptable to various contexts and datasets and is applied to the newly created dataset.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.