Automatic Text Summarization (ATS) for Research Documents in Sorani Kurdish

Published 20 Apr 2025 in cs.CL | (2504.14630v1)

Abstract: Extracting concise information from scientific documents aids learners, researchers, and practitioners. Automatic Text Summarization (ATS), a key NLP application, automates this process. While ATS methods exist for many languages, Kurdish remains underdeveloped due to limited resources. This study develops a dataset and LLM based on 231 scientific papers in Sorani Kurdish, collected from four academic departments in two universities in the Kurdistan Region of Iraq (KRI), averaging 26 pages per document. Using Sentence Weighting and Term Frequency-Inverse Document Frequency (TF-IDF) algorithms, two experiments were conducted, differing in whether the conclusions were included. The average word count was 5,492.3 in the first experiment and 5,266.96 in the second. Results were evaluated manually and automatically using ROUGE-1, ROUGE-2, and ROUGE-L metrics, with the best accuracy reaching 19.58%. Six experts conducted manual evaluations using three criteria, with results varying by document. This research provides valuable resources for Kurdish NLP researchers to advance ATS and related fields.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

Overview of Automatic Text Summarization for Sorani Kurdish Research Documents

The paper presents a detailed study focused on the development and evaluation of Automatic Text Summarization (ATS) techniques specifically designed for Sorani Kurdish research documents. Recognizing the scarcity of resources and tools available for the Kurdish language, particularly Sorani, the authors set out to build a foundational dataset and employ existing NLP techniques to address this gap.

Research Context and Motivation

Automatic Text Summarization is an essential tool in the field of NLP, pivotal for efficiently distilling critical information from large volumes of text. While advancements have been made in ATS for widely spoken languages, less common languages like Kurdish remain understudied due to limited resources. Sorani Kurdish, prevalent in parts of Iraq and Iran, faces significant challenges in text processing due to the lack of linguistic and computational tools. Therefore, developing ATS models for this language has practical implications for improving accessibility to scientific information and fostering academic engagement within the Kurdish research community.

Methodology and Experiments

The authors collected 231 research documents, with an average word count of around 5,500, from various academic departments to form their dataset. Two primary algorithms were utilized for summarization: Sentence Weighting and Term Frequency-Inverse Document Frequency (TF-IDF). The study comprised two experiments:

Inclusion of Conclusions: Research documents with their conclusion sections were included in the summarization process.
Exclusion of Conclusions: The same documents, excluding the conclusion sections, were used to assess the influence of conclusions on summarization efficacy.

The results highlighted that including the conclusion sections did not significantly enhance the summary quality, as evidenced by comparative ROUGE scores—a well-established metric for evaluating text summarization. The highest accuracy recorded was 19.58% with ROUGE-1 in the Social Science department in the second experiment.

Evaluation and Results

Manual evaluations were also conducted by domain experts using structured forms, focusing on content quality and grammatical accuracy of the generated summaries. This multi-faceted evaluation approach underlines the challenges associated with automated summarization in low-resource languages and underscores the nuanced nature of manual versus automatic content evaluation.

Implications and Future Directions

This research contributes valuable resources and insights to the field of Kurdish language processing, particularly for Sorani Kurdish. From a practical standpoint, it aids Kurdish academics in accessing and synthesizing research more efficiently. Theoretically, it provides a basis for expanding ATS techniques to similar low-resource languages, promoting linguistic diversity in NLP research.

Future work could involve integrating more advanced machine learning models, such as neural networks and transformer-based architectures, which are known to improve abstractive summarization. Additionally, expanding the dataset and exploring cross-dialect applications within Kurdish could provide broader applicability and insight.

In summary, the study is a pivotal step towards enhancing ATS capabilities for Sorani Kurdish, paving the way for further developments that bridge the technological gap for low-resource languages.