Code-Based English Models Surprising Performance on Chinese QA Pair Extraction Task

Published 16 Jan 2024 in cs.CL and cs.AI | (2401.10286v3)

Abstract: In previous studies, code-based models have consistently outperformed text-based models in reasoning-intensive scenarios. When generating our knowledge base for Retrieval-Augmented Generation (RAG), we observed that code-based models also perform exceptionally well in Chinese QA Pair Extraction task. Further, our experiments and the metrics we designed discovered that code-based models containing a certain amount of Chinese data achieve even better performance. Additionally, the capabilities of code-based English models in specified Chinese tasks offer a distinct perspective for discussion on the philosophical "Chinese Room" thought experiment.

Abstract PDF HTML Upgrade to Chat

References (28)

Summary

The paper reveals that code-based English models significantly outperform traditional language models in extracting Chinese QA pairs.
The experiments utilized models with around 7 billion parameters, employing controlled fine-tuning and rigorous hyperparameter consistency.
The results highlight reduced hallucinations and enhanced QA quality, underscoring effective cross-domain transfer from code to natural language.

Introduction to LLM Applications

The application of LLMs has become a focal point for advancements in NLP. As we scale up these models, we witness an improvement in performance and efficiency. Despite being broadly categorized based on their domains—LLMs for linguistic tasks, code-based models for programming—we're observing an intriguing crossover. In this exploration, we explore a curious phenomenon: code-based LLMs trained on English datasets have demonstrated impressive capabilities in Chinese text data generation tasks, a finding which questions the established domain-specificity of pre-trained models.

Exploration of Monolingual and Cross-domain Transfers

Our investigation builds upon existing research focused on transferring monolingual models across languages and domains. Such transfers have typically required models to come to grips with the structural and lexical diversity between languages or different data formats. In the context of transferring skills across domains from code to text, a model must not only understand but also translate and correlate between these domains. Two pieces of literature underpin these research areas: The Cross-lingual LLM (XLM), which aids in learning shared representations, and CodeBERT, designed for understanding and generating programming languages, which highlights model application in non-native domains.

Methodological Framework

For our experiments, we aimed to generate high-quality question-answer pairs as data for RAG models, from a dataset featuring meticulously annotated open-source and private documents. We carefully selected models with around 7 billion parameters for their efficiency and generalization capabilities. Notably, this included both Chinese-based and English-based code models. We primarily opted for Pre-trained LLMs (PLMs) instead of Supervised Fine-Tuning (SFT) models due to the potential interference from domain-specific SFT data. Detailed hyperparameter consistency and rigorous fine-tuning processes were pursued to ensure controlled and reliable outputs.

Surprising Experimental Results

The experimental results were both surprising and enlightening. Code-based LLMs outperformed other LLMs in terms of knowledge extraction quality, as assessed by a curated set of evaluation metrics. Interestingly, models less familiar with Chinese had lower tendency for hallucination, which is crucial for tasks demanding faithful reproduction of the original content. Moreover, we saw that adding moderate amounts of Chinese knowledge improved the models' Chinese processing capabilities without incurring significant hallucinations. However, a different fine-tuning approach using QLoRA did not replicate the desired task-specific capabilities.

Implications for AI Development

These findings contribute to the ongoing discourse surrounding the capabilities of pre-trained models in novel contexts, suggesting task-relevant skills could overshadow language congruence when determining model applicability. The observations also stir the long-standing discussion on "ability sharing" across data forms in large-scale models and challenge the notion of needing every parameter to retain learned capabilities.

Anticipated Future Investigations

Looking to the future, we envision a controlled AI framework governed by a regulated Chinese manual and aim to further probe the use of code-based models for Chinese-related tasks. Our ambition extends to overcoming the challenges of interpreting garbled Chinese characters, enhancing OCR capabilities, and broadly tapping into the power of code-based models to reimagine AI's efficacy in language processing tasks within the fascinating interplay of language, structure, and domain.

Markdown Report Issue