2000 character limit reached
Towards Comprehensive Vietnamese Retrieval-Augmented Generation and Large Language Models
Published 3 Mar 2024 in cs.CL | (2403.01616v2)
Abstract: This paper presents our contributions towards advancing the state of Vietnamese language understanding and generation through the development and dissemination of open datasets and pre-trained models for Vietnamese Retrieval-Augmented Generation (RAG) and LLMs.
- Vuong Quoc Binh. Binhvq News Corpus. https://github.com/binhvq/news-corpus, 2018. [Online; accessed 01-March-2024].
- Natural Language Processing Laboratory of Tsinghua University. Chinese Text Classification. http://thuctc.thunlp.org/, 2016. [Online; accessed 01-March-2024].
- Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
- Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177, 2023.
- Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196, 2023.
- PhoBERT: Pre-trained language models for Vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1037–1042, 2020.
- Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. arXiv preprint arXiv:2309.09400, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.