Papers
Topics
Authors
Recent
Search
2000 character limit reached

IndicSQuAD: A Comprehensive Multilingual Question Answering Dataset for Indic Languages

Published 6 May 2025 in cs.CL and cs.LG | (2505.03688v2)

Abstract: The rapid progress in question-answering (QA) systems has predominantly benefited high-resource languages, leaving Indic languages largely underrepresented despite their vast native speaker base. In this paper, we present IndicSQuAD, a comprehensive multi-lingual extractive QA dataset covering nine major Indic languages, systematically derived from the SQuAD dataset. Building on previous work with MahaSQuAD for Marathi, our approach adapts and extends translation techniques to maintain high linguistic fidelity and accurate answer-span alignment across diverse languages. IndicSQuAD comprises extensive training, validation, and test sets for each language, providing a robust foundation for model development. We evaluate baseline performances using language-specific monolingual BERT models and the multilingual MuRIL-BERT. The results indicate some challenges inherent in low-resource settings. Moreover, our experiments suggest potential directions for future work, including expanding to additional languages, developing domain-specific datasets, and incorporating multimodal data. The dataset and models are publicly shared at https://github.com/l3cube-pune/indic-nlp

Summary

Overview of LuaLaTeX and XeLaTeX Template for *ACL Style Files

The paper "LuaLaTeX and XeLaTeX Template for *ACL Style Files" offers a technical guide for utilizing *ACL style files with LuaLaTeX and XeLaTeX, which are sophisticated TeX-based typesetting systems. The document stands as a reference point for individuals seeking to produce high-quality academic papers adhering to *ACL standards, particularly within the computational linguistics community where *ACL publications are of significant relevance.

Typesetting with LuaLaTeX and XeLaTeX

LuaLaTeX and XeLaTeX are notable for their ability to handle complex scripts and fonts, making them invaluable tools for typesetting documents that incorporate multiple languages and scripts. This paper includes specific examples showcasing text in Hindi and Arabic, demonstrating the integration and accurate rendering of diverse character sets. It illustrates the strengths of these typesetting systems in supporting multilingual document production, a critical feature for researchers who work with cross-linguistic data or produce analyses in a polyglot academic environment.

Citation Standards

The template also emphasizes proper citation methods consistent with *ACL style requirements. The reference section of the paper provides various examples of citation formats, such as books, articles, conference proceedings, and miscellaneous sources. This reinforces the importance of bibliographic consistency and accuracy in academic writing, facilitating proper attribution and aiding researchers in cross-referencing related works.

Practical Applications

Practically, the use of LuaLaTeX and XeLaTeX with *ACL style templates aids in overcoming the limitations faced with traditional LaTeX systems, particularly in terms of font handling and script compatibility. Their robust capabilities ensure that researchers in computational linguistics can produce technically sound and aesthetically pleasing documents that reflect the scholarly rigor of their investigations without manual interventions for typesetting complex scripts.

Future Directions

Looking ahead, the continual development and refinement of typesetting systems like LuaLaTeX and XeLaTeX will likely enhance their intuitiveness and accessibility, removing barriers to entry for researchers new to TeX environments. Furthermore, as interdisciplinary research becomes more prevalent, these systems may evolve to support even wider arrays of linguistic and symbolic representations, further cementing their value in academic circles.

In conclusion, the paper serves as an essential guide for computational linguistics scholars focused on maintaining high publication standards while leveraging advanced typesetting capabilities. Its implementation not only promises uniformity in adherence to *ACL styles but also assures technical excellence in document presentation across multifaceted linguistic landscapes.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.