GlotScript: A Resource and Tool for Low Resource Writing System Identification

Published 23 Sep 2023 in cs.CL | (2309.13320v2)

Abstract: We present GlotScript, an open resource and tool for low resource writing system identification. GlotScript-R is a resource that provides the attested writing systems for more than 7,000 languages. It is compiled by aggregating information from existing writing system resources. GlotScript-T is a writing system identification tool that covers all 161 Unicode 15.0 scripts. For an input text, it returns its script distribution where scripts are identified by ISO 15924 codes. We also present two use cases for GlotScript. First, we demonstrate that GlotScript can help cleaning multilingual corpora such as mC4 and OSCAR. Second, we analyze the tokenization of a number of LLMs such as GPT-4 using GlotScript and provide insights on the coverage of low resource scripts and languages by each LLM. We hope that GlotScript will become a useful resource for work on low resource languages in the NLP community. GlotScript-R and GlotScript-T are available at https://github.com/cisnlp/GlotScript.

Abstract PDF HTML Upgrade to Chat

References (27)

Citations (9)

View on Semantic Scholar

Summary

The paper introduces GlotScript, a dual-component tool combining GlotScript-R for extensive writing system metadata and GlotScript-T for accurate script identification.
The paper employs a per-character analysis across 161 Unicode scripts, achieving high identification accuracy with scores of 0.947 for OSCAR and 0.917 for mC4.
The research enhances multilingual NLP by improving dataset quality and offering a reliable framework for script detection in low-resource language processing.

Overview of GlotScript: A Resource and Tool for Low Resource Writing System Identification

The paper presents the creation and application of GlotScript, an open resource and tool designed for the identification of writing systems, particularly for low-resource languages. This is composed of two primary components: GlotScript-R, a comprehensive repository for writing systems of over 7,000 languages, and GlotScript-T, a script identification tool that covers all 161 scripts defined in Unicode 15.0, employing ISO 15924 codes for identification.

Key Contributions

Development of GlotScript-R: This resource compiles information from several existing writing system resources into an organized metadata structure that offers detailed script information for a vast number of languages. The metadata employs dual categorization into MAIN and AUXILIARY to balance accuracy and descriptive richness.
Creation of GlotScript-T: This tool facilitates high accuracy script identification using a per-character approach with comprehensive Unicode 15.0 coverage. It is unique in its ability to identify script distributions at this level, providing detailed script analysis of input text.
Use Case Demonstrations: The paper illustrates the utility of GlotScript in cleaning multilingual datasets extracted from sources like mC4 and OSCAR and examining the token coverage of LLMs such as GPT-4, Falcon, and Llama2.

Numerical Results and Observations

In their application to multilingual corpora, the authors report accuracy improvements through script filtering, with OSCAR achieving an average script accuracy of 0.947, and mC4 an average of 0.917. For the model tokenization analysis, the comprehensiveness of tokenizer script coverage was mapped, showing varying degrees of script representation with notable scarcity in languages that use non-Latin scripts.

Practical and Theoretical Implications

Practically, GlotScript holds potential for improving data quality in multilingual language processing tasks, notably by identifying and excluding content that does not match expected language-script assignments, thereby enhancing the reliability of low-resource corpora.

Theoretically, the creation of a definitive script identification framework fills a notable gap in multilingual NLP. By providing a tool that operates with high accuracy on character-level script identification, GlotScript encourages further exploration into culturally and linguistically diverse datasets that require complex script recognition.

Speculations on Future Developments

In the evolving landscape of AI and NLP, tools like GlotScript are critical in broadening the reach and utility of LLMs across more linguistic varieties. As multilingual models continue to improve, their reliance on robust script identification resources will likely increase. Furthermore, as the community continues to engage with rare and historical scripts, the parallel refinement and expansion of resources like GlotScript-R could lead to substantial improvements in processing more minor languages. Future work could involve integrating dynamic updates into writing system metadata, encapsulating usage trends to reflect the live, rare, and historical categorizations of scripts, thus enhancing data utility and quality further. As AI models assume more roles in language-inclusive policies and technologies, leveraging these tools optimally could ensure better linguistic representation worldwide.

Markdown Report Issue