Papers
Topics
Authors
Recent
Search
2000 character limit reached

Benchmarking the Performance of Pre-trained LLMs across Urdu NLP Tasks

Published 24 May 2024 in cs.CL and cs.AI | (2405.15453v2)

Abstract: LLMs pre-trained on multilingual data have revolutionized natural language processing research, by transitioning from languages and task specific model pipelines to a single model adapted on a variety of tasks. However majority of existing multilingual NLP benchmarks for LLMs provide evaluation data in only few languages with little linguistic diversity. In addition these benchmarks lack quality assessment against the respective state-of the art models. This study presents an in-depth examination of 7 prominent LLMs: GPT-3.5-turbo, Llama 2-7B-Chat, Llama 3.1-8B, Bloomz 3B, Bloomz 7B1, Ministral-8B and Whisper (Large, medium and small variant) across 17 tasks using 22 datasets, 13.8 hours of speech, in a zero-shot setting, and their performance against state-of-the-art (SOTA) models, has been compared and analyzed. Our experiments show that SOTA models currently outperform encoder-decoder models in majority of Urdu NLP tasks under zero-shot settings. However, comparing Llama 3.1-8B over prior version Llama 2-7B-Chat, we can deduce that with improved language coverage, LLMs can surpass these SOTA models. Our results emphasize that models with fewer parameters but richer language-specific data, like Llama 3.1-8B, often outperform larger models with lower language diversity, such as GPT-3.5, in several tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Openai api pricing.
  2. Larabench: Benchmarking arabic ai with large language models.
  3. Addressing cyberbullying in urdu tweets: a comprehensive dataset and detection system. PeerJ Comput. Sci., 10:e1963.
  4. Mega: Multilingual evaluation of generative ai.
  5. Ise-hate: A benchmark corpus for inter-faith, sectarian, and ethnic hatred detection on social media in urdu. Information Processing & Management, 60.
  6. Improving hate speech detection of urdu tweets using sentiment analysis. IEEE Access, PP:1–1.
  7. Multi-label emotion classification of urdu tweets. PeerJ Computer Science, 8:e896.
  8. Context-aware emotion detection from low-resource urdu language using deep neural network. ACM Trans. Asian Low-Resour. Lang. Inf. Process., 22(5).
  9. Chatgpt is a knowledgeable but inexperienced solver: An investigation of commonsense problem in large language models.
  10. BigScienceWorkshop and et al. 2023. Bloom: A 176b-parameter open-access multilingual language model.
  11. On the opportunities and risks of foundation models. CoRR, abs/2108.07258.
  12. Abusive and threatening language detection in urdu using boosting based and BERT based models: A comparative approach. CoRR, abs/2111.14830.
  13. Paul Ekman. 1999. Basic Emotions, chapter 3. John Wiley & Sons, Ltd.
  14. How good are gpt models at machine translation? a comprehensive evaluation.
  15. Muhammad Humayoun and Naheed Akhtar. 2022. Corpures: Benchmark corpus for urdu extractive summaries and experiments using supervised learning. Intelligent Systems with Applications, 16:200129.
  16. Bushra Jawaid and Daniel Zeman. 2011. Word-order issues in english-to-urdu statistical machine translation. The Prague Bulletin of Mathematical Linguistics, 95.
  17. Urdu named entity recognition: Corpus generation and deep learning applications.
  18. Prosoul: A framework to identify propaganda from online urdu content.
  19. Fake news classification using machine learning: Count vectorizer and support vector machine. Journal of Computing & Biomedical Informatics, 4.
  20. The bigscience roots corpus: A 1.6tb composite multilingual dataset.
  21. Holistic evaluation of language models.
  22. Llmrec: Benchmarking large language models on recommendation task.
  23. Crosslingual generalization through multitask finetuning.
  24. Khalid Bin Muhammad and S. M. Aqil Burney. 2023. Innovations in urdu sentiment analysis using machine and deep learning techniques for two-class classification of symmetric datasets. Symmetry, 15(5).
  25. OpenAI and et al. 2023. Gpt-4 technical report.
  26. Sana Shams and Muhammad Aslam. 2022. Improving user intent detection in urdu web queries with capsule net architectures. Applied Sciences, 12:11861.
  27. Counter: corpus of urdu news text reuse. Language Resources and Evaluation, 51.
  28. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.
  29. Llama 2: Open foundation and fine-tuned chat models.
  30. Corpus of aspect-based sentiment for urdu political data.
  31. Attention is all you need. CoRR, abs/1706.03762.
  32. QMSum: A new benchmark for query-based multi-domain meeting summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5905–5921, Online. Association for Computational Linguistics.
Citations (2)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.