L3Cube-MahaSum: A Comprehensive Dataset and BART Models for Abstractive Text Summarization in Marathi

Published 11 Oct 2024 in cs.CL and cs.LG | (2410.09184v1)

Abstract: We present the MahaSUM dataset, a large-scale collection of diverse news articles in Marathi, designed to facilitate the training and evaluation of models for abstractive summarization tasks in Indic languages. The dataset, containing 25k samples, was created by scraping articles from a wide range of online news sources and manually verifying the abstract summaries. Additionally, we train an IndicBART model, a variant of the BART model tailored for Indic languages, using the MahaSUM dataset. We evaluate the performance of our trained models on the task of abstractive summarization and demonstrate their effectiveness in producing high-quality summaries in Marathi. Our work contributes to the advancement of natural language processing research in Indic languages and provides a valuable resource for future research in this area using state-of-the-art models. The dataset and models are shared publicly at https://github.com/l3cube-pune/MarathiNLP

Abstract PDF HTML Upgrade to Chat

Summary

The paper presents a novel MahaSUM dataset comprising 25K+ manually curated Marathi news articles for abstractive text summarization.
It adapts an IndicBART model with language-specific tokenization and embeddings to effectively manage the nuances of Marathi text.
Comparative evaluation reveals that the MahaSUM-trained model outperforms benchmarks, achieving ROUGE-1 of 0.2432 and ROUGE-2 of 0.1711.

Comprehensive Overview of "L3Cube-MahaSum: A Comprehensive Dataset and BART Models for Abstractive Text Summarization in Marathi"

The paper "L3Cube-MahaSum: A Comprehensive Dataset and BART Models for Abstractive Text Summarization in Marathi" presents significant advancements in the domain of NLP for Indic languages, specifically Marathi. Developed by researchers associated with L3Cube Labs and various academic institutes in India, this study makes three main contributions to the field: the construction of a novel dataset, the adaptation and application of a BART model variant for Marathi summarization, and a comparative analysis of these efforts against existing datasets and models.

Key Contributions

MahaSUM Dataset: The paper introduces the MahaSUM dataset, a large-scale collection of 25,374 news articles sourced from prominent Marathi news platforms such as Lokmat and Loksatta. What distinguishes MahaSUM is its manual curation and verification process, which ensures the high quality of abstractive summaries. The dataset addresses the significant lack of resources for Marathi and other Indic languages, providing a foundational resource for advancing NLP research within these linguistic contexts.
IndicBART Model: Another essential contribution of the study is the training of an IndicBART model using the MahaSUM dataset. IndicBART, an optimized variant of BART tailored for Indic languages, was developed to handle the linguistic intricacies of Marathi text, incorporating language-specific tokenization and embeddings. The model’s architecture, inspired by the foundational BART model, is mapped to the Devanagari script to enhance cross-language learning efficiencies.
Comparative Evaluation: The paper conducts a comparative performance analysis between the IndicBART when trained on the MahaSUM dataset and the Marathi subset of the pre-existing XL-Sum dataset. The study employs ROUGE metrics to quantify the performance, with the MahaSUM-trained model outperforming other benchmarks: ROUGE-1 and ROUGE-2 of 0.2432 and 0.1711, respectively.

Implications and Future Directions

The introduction of MahaSUM not only enhances the availability of resources for Marathi but also sets a precedent for the development of similar datasets for other low-resource Indic languages. The data collection methodology, alongside the manual verification of summaries, highlights an approach that could be emulated in other linguistic contexts.

The study’s model adaptation and fine-tuning activities underscore the potential of leveraging sophisticated transformer models for low-resource languages. They emphasize the importance of tailoring such models to account for language-specific nuances, which can lead to significant improvements in task-specific performance metrics.

Theoretical and Practical Impact

On a theoretical level, this research contributes to the growing body of knowledge aimed at extending NLP technologies to less-represented languages. It echoes the necessity of building comprehensive datasets and modifying state-of-the-art architectures to account for linguistic diversity's complexities.

Practically, the applications of such a model are numerous and include improved text summarization for news and journalistic outlets, enhanced computational understanding within other Marathi text domains, and more efficient content retrieval systems. By substantially enriching the resources available for Marathi, L3Cube-MahaSum facilitates the design and development of a vast array of language technologies that could leverage this dataset for future advancements.

Conclusion

"L3Cube-MahaSum: A Comprehensive Dataset and BART Models for Abstractive Text Summarization in Marathi" makes crucial strides in addressing the disparity in NLP resources for Indic languages. By developing a robust dataset and adopting a fine-tuned BART model variant, the authors provide both practical tools and theoretical insights that will certainly guide future research efforts in this promising area of linguistics and computer science. The public availability of MahaSUM and associated models opens avenues for continued exploration and innovation, further expanding the reach of NLP capabilities across diverse linguistic landscapes.