Papers
Topics
Authors
Recent
Search
2000 character limit reached

SAHAAYAK 2023 -- the Multi Domain Bilingual Parallel Corpus of Sanskrit to Hindi for Machine Translation

Published 27 Jun 2023 in cs.CL | (2307.00021v1)

Abstract: The data article presents the large bilingual parallel corpus of low-resourced language pair Sanskrit-Hindi, named SAHAAYAK 2023. The corpus contains total of 1.5M sentence pairs between Sanskrit and Hindi. To make the universal usability of the corpus and to make it balanced, data from multiple domain has been incorporated into the corpus that includes, News, Daily conversations, Politics, History, Sport, and Ancient Indian Literature. The multifaceted approach has been adapted to make a sizable multi-domain corpus of low-resourced languages like Sanskrit. Our development approach is spanned from creating a small hand-crafted dataset to applying a wide range of mining, cleaning, and verification. We have used the three-fold process of mining: mining from machine-readable sources, mining from non-machine readable sources, and collation from existing corpora sources. Post mining, the dedicated pipeline for normalization, alignment, and corpus cleaning is developed and applied to the corpus to make it ready to use on machine translation algorithms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (5)
  1. Article-343(1). Official languages of the union and eight schedule, constitution of republic of india, government of india. 24 June 2023. URL http://mha.gov.in.
  2. Corpus Design Criteria. Literary and Linguistic Computing, 7(1):1–16, 01 1992. ISSN 0268-1145. doi: 10.1093/llc/7.1.1. URL https://doi.org/10.1093/llc/7.1.1.
  3. Census2011. Language – india, states, and union territories, office of registrar general, government of india. 24 June 2023. URL https://censusindia.gov.in/.
  4. Singh, A. K. Natural language processing for less privileged languages: Where do we come from? where are we going? In Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages, 2008. URL https://aclanthology.org/I08-3004.
  5. Neural-based machine translation system outperforming statistical phrase-based machine translation for low-resource languages. In 2019 Twelfth International Conference on Contemporary Computing (IC3), pp.  1–7, 2019. doi: 10.1109/IC3.2019.8844915.
Citations (1)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.