Papers
Topics
Authors
Recent
Search
2000 character limit reached

Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation

Published 7 Apr 2025 in cs.CL and cs.IR | (2504.08792v1)

Abstract: Named Entity Recognition (NER), a fundamental task in NLP, has shown significant advancements for high-resource languages. However, due to a lack of annotated datasets and limited representation in Pre-trained LLMs (PLMs), it remains understudied and challenging for low-resource languages. To address these challenges, we propose a data augmentation technique that generates culturally plausible sentences and experiments on four low-resource Pakistani languages; Urdu, Shahmukhi, Sindhi, and Pashto. By fine-tuning multilingual masked LLMs, our approach demonstrates significant improvements in NER performance for Shahmukhi and Pashto. We further explore the capability of generative LLMs for NER and data augmentation using few-shot learning.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.