Papers
Topics
Authors
Recent
Search
2000 character limit reached

Code-Based English Models Surprising Performance on Chinese QA Pair Extraction Task

Published 16 Jan 2024 in cs.CL and cs.AI | (2401.10286v3)

Abstract: In previous studies, code-based models have consistently outperformed text-based models in reasoning-intensive scenarios. When generating our knowledge base for Retrieval-Augmented Generation (RAG), we observed that code-based models also perform exceptionally well in Chinese QA Pair Extraction task. Further, our experiments and the metrics we designed discovered that code-based models containing a certain amount of Chinese data achieve even better performance. Additionally, the capabilities of code-based English models in specified Chinese tasks offer a distinct perspective for discussion on the philosophical "Chinese Room" thought experiment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. On the Cross-lingual Transferability of Monolingual Representations // Annual Meeting of the Association for Computational Linguistics. 2019.
  2. Qwen Technical Report // ArXiv. 2023. abs/2309.16609.
  3. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism // ArXiv. 2024.
  4. Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision. 2023.
  5. ChatGLM2 Team . ChatGLM2-6B: An Open Bilingual Chat LLM. 2023.
  6. Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca // arXiv preprint arXiv:2304.08177. 2023.
  7. QLoRA: Efficient Finetuning of Quantized LLMs // ArXiv. 2023. abs/2305.14314.
  8. CodeBERT: A Pre-Trained Model for Programming and Natural Languages // ArXiv. 2020. abs/2002.08155.
  9. Fu Hao Yao; Peng, Khot Tushar. How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources // Yao Fu’s Notion. Dec 2022.
  10. Scaling Laws for Neural Language Models // ArXiv. 2020. abs/2001.08361.
  11. Lample Guillaume, Conneau Alexis. Cross-lingual Language Model Pretraining // ArXiv. 2019. abs/1901.07291.
  12. MLQA: Evaluating Cross-lingual Extractive Question Answering // ArXiv. 2019. abs/1910.07475.
  13. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks // ArXiv. 2020. abs/2005.11401.
  14. CLEVA: Chinese Language Models EVAluation Platform // ArXiv. 2023. abs/2308.04813.
  15. Lin Chin-Yew. ROUGE: A Package for Automatic Evaluation of Summaries // Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics, VII 2004. 74–81.
  16. Moural Josef. John Searle: The Chinese Room Argument. 2003.
  17. RWKV: Reinventing RNNs for the Transformer Era // Conference on Empirical Methods in Natural Language Processing. 2023.
  18. Code Llama: Open foundation models for code // arXiv preprint arXiv:2308.12950. 2023.
  19. DRCD: a Chinese Machine Reading Comprehension Dataset // ArXiv. 2018. abs/1806.00920.
  20. A Hierarchical Encoding-Decoding Scheme for Abstractive Multi-document Summarization // Conference on Empirical Methods in Natural Language Processing. 2023.
  21. RoFormer: Enhanced Transformer with Rotary Position Embedding // ArXiv. 2021. abs/2104.09864.
  22. A Length-Extrapolatable Transformer // ArXiv. 2022. abs/2212.10554.
  23. Stanford Alpaca: An Instruction-following LLaMA model. 2023.
  24. Representing Numbers in NLP: a Survey and a Vision // ArXiv. 2021. abs/2103.13136.
  25. LLaMA: Open and efficient foundation language models // arXiv preprint arXiv:2302.13971. 2023.
  26. Probing Pretrained Language Models for Lexical Semantics // Conference on Empirical Methods in Natural Language Processing. 2020.
  27. Code4Struct: Code Generation for Few-Shot Event Structure Prediction // Annual Meeting of the Association for Computational Linguistics. 2022.
  28. Baichuan 2: Open Large-scale Language Models // ArXiv. 2023. abs/2309.10305.

Summary

  • The paper reveals that code-based English models significantly outperform traditional language models in extracting Chinese QA pairs.
  • The experiments utilized models with around 7 billion parameters, employing controlled fine-tuning and rigorous hyperparameter consistency.
  • The results highlight reduced hallucinations and enhanced QA quality, underscoring effective cross-domain transfer from code to natural language.

Introduction to LLM Applications

The application of LLMs has become a focal point for advancements in NLP. As we scale up these models, we witness an improvement in performance and efficiency. Despite being broadly categorized based on their domains—LLMs for linguistic tasks, code-based models for programming—we're observing an intriguing crossover. In this exploration, we explore a curious phenomenon: code-based LLMs trained on English datasets have demonstrated impressive capabilities in Chinese text data generation tasks, a finding which questions the established domain-specificity of pre-trained models.

Exploration of Monolingual and Cross-domain Transfers

Our investigation builds upon existing research focused on transferring monolingual models across languages and domains. Such transfers have typically required models to come to grips with the structural and lexical diversity between languages or different data formats. In the context of transferring skills across domains from code to text, a model must not only understand but also translate and correlate between these domains. Two pieces of literature underpin these research areas: The Cross-lingual LLM (XLM), which aids in learning shared representations, and CodeBERT, designed for understanding and generating programming languages, which highlights model application in non-native domains.

Methodological Framework

For our experiments, we aimed to generate high-quality question-answer pairs as data for RAG models, from a dataset featuring meticulously annotated open-source and private documents. We carefully selected models with around 7 billion parameters for their efficiency and generalization capabilities. Notably, this included both Chinese-based and English-based code models. We primarily opted for Pre-trained LLMs (PLMs) instead of Supervised Fine-Tuning (SFT) models due to the potential interference from domain-specific SFT data. Detailed hyperparameter consistency and rigorous fine-tuning processes were pursued to ensure controlled and reliable outputs.

Surprising Experimental Results

The experimental results were both surprising and enlightening. Code-based LLMs outperformed other LLMs in terms of knowledge extraction quality, as assessed by a curated set of evaluation metrics. Interestingly, models less familiar with Chinese had lower tendency for hallucination, which is crucial for tasks demanding faithful reproduction of the original content. Moreover, we saw that adding moderate amounts of Chinese knowledge improved the models' Chinese processing capabilities without incurring significant hallucinations. However, a different fine-tuning approach using QLoRA did not replicate the desired task-specific capabilities.

Implications for AI Development

These findings contribute to the ongoing discourse surrounding the capabilities of pre-trained models in novel contexts, suggesting task-relevant skills could overshadow language congruence when determining model applicability. The observations also stir the long-standing discussion on "ability sharing" across data forms in large-scale models and challenge the notion of needing every parameter to retain learned capabilities.

Anticipated Future Investigations

Looking to the future, we envision a controlled AI framework governed by a regulated Chinese manual and aim to further probe the use of code-based models for Chinese-related tasks. Our ambition extends to overcoming the challenges of interpreting garbled Chinese characters, enhancing OCR capabilities, and broadly tapping into the power of code-based models to reimagine AI's efficacy in language processing tasks within the fascinating interplay of language, structure, and domain.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 32 likes about this paper.