Papers
Topics
Authors
Recent
Search
2000 character limit reached

GlotScript: A Resource and Tool for Low Resource Writing System Identification

Published 23 Sep 2023 in cs.CL | (2309.13320v2)

Abstract: We present GlotScript, an open resource and tool for low resource writing system identification. GlotScript-R is a resource that provides the attested writing systems for more than 7,000 languages. It is compiled by aggregating information from existing writing system resources. GlotScript-T is a writing system identification tool that covers all 161 Unicode 15.0 scripts. For an input text, it returns its script distribution where scripts are identified by ISO 15924 codes. We also present two use cases for GlotScript. First, we demonstrate that GlotScript can help cleaning multilingual corpora such as mC4 and OSCAR. Second, we analyze the tokenization of a number of LLMs such as GPT-4 using GlotScript and provide insights on the coverage of low resource scripts and languages by each LLM. We hope that GlotScript will become a useful resource for work on low resource languages in the NLP community. GlotScript-R and GlotScript-T are available at https://github.com/cisnlp/GlotScript.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-9) 2021. Limerick, 12 July 2021 (Online-Event), pages 1 – 9, Mannheim. Leibniz-Institut für Deutsche Sprache.
  2. stopes - modular machine translation pipelines. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 258–265, Abu Dhabi, UAE. Association for Computational Linguistics.
  3. Natural language processing with small feed-forward networks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2879–2885, Copenhagen, Denmark. Association for Computational Linguistics.
  4. Sascha Brawer. 2017. Corpus crawler.
  5. Ralf Brown. 2014. Non-linear mapping for improved identification of 1300+ languages. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 627–632, Doha, Qatar. Association for Computational Linguistics.
  6. Language id in the wild: Unexpected challenges on the path to a thousand-language web text corpus. arXiv preprint arXiv:2010.14571.
  7. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  8. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  9. Matthew S Dryer and Martin Haspelmath. 2013. WALS: World Atlas of Language Structures. Max Planck Institute for Evolutionary Anthropology.
  10. Ethnologue: Languages of the world. twenty-sixth edition. SIL International.
  11. Glottolog 4.8. Leipzig: Max Planck Institute for Evolutionary Anthropology.
  12. Glot500: Scaling multilingual corpora and language models to 500 languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1082–1117, Toronto, Canada. Association for Computational Linguistics.
  13. Glot500: Scaling multilingual corpora and language models to 500 languages. arXiv preprint arXiv:2305.12182.
  14. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 427–431.
  15. PanLex: Building a resource for panlingual lexical translation. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 3145–3150, Reykjavik, Iceland. European Language Resources Association (ELRA).
  16. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72.
  17. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
  18. OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  19. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019, pages 9 – 16, Mannheim. Leibniz-Institut für Deutsche Sprache.
  20. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
  21. Compact language detector v3.
  22. Kevin P Scannell. 2007. The crúbadán project: Corpus building for under-resourced languages. In Building and Exploring Web Corpora (WAC3-2007): Proceedings of the 3rd Web as Corpus Workshop, Incorporating Cleaneval, volume 4, page 5. Presses univ. de Louvain.
  23. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  24. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  25. Writing system and speaker metadata for 2,800+ language varieties. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5035–5046, Marseille, France. European Language Resources Association.
  26. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
  27. Judit Ács. 2019. Exploring bert’s vocabulary.
Citations (9)

Summary

  • The paper introduces GlotScript, a dual-component tool combining GlotScript-R for extensive writing system metadata and GlotScript-T for accurate script identification.
  • The paper employs a per-character analysis across 161 Unicode scripts, achieving high identification accuracy with scores of 0.947 for OSCAR and 0.917 for mC4.
  • The research enhances multilingual NLP by improving dataset quality and offering a reliable framework for script detection in low-resource language processing.

Overview of GlotScript: A Resource and Tool for Low Resource Writing System Identification

The paper presents the creation and application of GlotScript, an open resource and tool designed for the identification of writing systems, particularly for low-resource languages. This is composed of two primary components: GlotScript-R, a comprehensive repository for writing systems of over 7,000 languages, and GlotScript-T, a script identification tool that covers all 161 scripts defined in Unicode 15.0, employing ISO 15924 codes for identification.

Key Contributions

  1. Development of GlotScript-R: This resource compiles information from several existing writing system resources into an organized metadata structure that offers detailed script information for a vast number of languages. The metadata employs dual categorization into MAIN and AUXILIARY to balance accuracy and descriptive richness.
  2. Creation of GlotScript-T: This tool facilitates high accuracy script identification using a per-character approach with comprehensive Unicode 15.0 coverage. It is unique in its ability to identify script distributions at this level, providing detailed script analysis of input text.
  3. Use Case Demonstrations: The paper illustrates the utility of GlotScript in cleaning multilingual datasets extracted from sources like mC4 and OSCAR and examining the token coverage of LLMs such as GPT-4, Falcon, and Llama2.

Numerical Results and Observations

In their application to multilingual corpora, the authors report accuracy improvements through script filtering, with OSCAR achieving an average script accuracy of 0.947, and mC4 an average of 0.917. For the model tokenization analysis, the comprehensiveness of tokenizer script coverage was mapped, showing varying degrees of script representation with notable scarcity in languages that use non-Latin scripts.

Practical and Theoretical Implications

Practically, GlotScript holds potential for improving data quality in multilingual language processing tasks, notably by identifying and excluding content that does not match expected language-script assignments, thereby enhancing the reliability of low-resource corpora.

Theoretically, the creation of a definitive script identification framework fills a notable gap in multilingual NLP. By providing a tool that operates with high accuracy on character-level script identification, GlotScript encourages further exploration into culturally and linguistically diverse datasets that require complex script recognition.

Speculations on Future Developments

In the evolving landscape of AI and NLP, tools like GlotScript are critical in broadening the reach and utility of LLMs across more linguistic varieties. As multilingual models continue to improve, their reliance on robust script identification resources will likely increase. Furthermore, as the community continues to engage with rare and historical scripts, the parallel refinement and expansion of resources like GlotScript-R could lead to substantial improvements in processing more minor languages. Future work could involve integrating dynamic updates into writing system metadata, encapsulating usage trends to reflect the live, rare, and historical categorizations of scripts, thus enhancing data utility and quality further. As AI models assume more roles in language-inclusive policies and technologies, leveraging these tools optimally could ensure better linguistic representation worldwide.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 56 likes about this paper.