KnowCoder-X: Boosting Multilingual Information Extraction via Code

Published 7 Nov 2024 in cs.CL, cs.AI, and cs.LG | (2411.04794v3)

Abstract: Empirical evidence indicates that LLMs exhibit spontaneous cross-lingual alignment. However, although LLMs show promising cross-lingual alignment in Information Extraction (IE), a significant imbalance across languages persists, highlighting an underlying deficiency. To address this, we propose KnowCoder-X, a powerful code LLM with advanced cross-lingual and multilingual capabilities for universal IE. Firstly, it standardizes the representation of multilingual schemas using Python classes, ensuring a consistent ontology across different languages. Then, IE across languages is formulated as a unified code generation task. Secondly, we conduct IE cross-lingual alignment instruction tuning on the translated instance prediction task to enhance the model's cross-lingual transferability. During this phase, we also construct a high-quality and diverse bilingual IE parallel dataset with 257k samples, called ParallelNER, synthesized by our proposed robust three-stage pipeline, with manual annotation to ensure quality. Although without training in 29 unseen languages, KnowCoder-X surpasses ChatGPT by 30.17\% and SoTA by 20.03\%, thereby demonstrating superior cross-lingual IE capabilities. Comprehensive evaluations on 64 IE benchmarks in Chinese and English under various settings demonstrate that KnowCoder-X significantly enhances cross-lingual IE transfer through boosting the IE alignment. Our code and dataset are available at: https://github.com/ICT-GoKnow/KnowCoder

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces a novel code-based framework that treats information extraction as a code generation task to standardize schema representations across languages.
It employs a cross-lingual alignment phase using a bilingual dataset, achieving a 30.17% improvement over ChatGPT and 20.03% over SoTA systems.
Experimental results show that AlignXIE ranks in the top 2 on 40 of 42 benchmarks, demonstrating robust performance in both NER and RE tasks.

AlignXIE: Enhancing Multilingual Information Extraction through Cross-Lingual Alignment

The paper "AlignXIE: Improving Multilingual Information Extraction by Cross-Lingual Alignment" introduces a novel framework designed to enhance cross-lingual capabilities in multilingual information extraction (IE). Recognizing the inherent imbalances across languages in traditional LLMs, the authors propose AlignXIE, a code-based LLM framework that utilizes cross-lingual alignment to improve performance in information extraction tasks.

Core Contributions

AlignXIE introduces two principal strategies aimed at improving cross-lingual alignment in information extraction:

Unified Code Generation Framework:
- AlignXIE treats information extraction as a code generation task by using Python classes to standardize schema representations across different languages. This approach emphasizes leveraging the language-agnostic properties of code to ensure consistent semantic representation across language boundaries. Python classes provide a uniform template, which simplifies the integration and alignment of schema representations across typologically diverse languages.
Cross-Lingual Alignment Phase:
- The alignment phase enhances the extraction process by employing a translated instance prediction task. It leverages ParallelNER, a bilingual parallel dataset, to guide the learning process. Importantly, this dataset is constructed using an LLM-based pipeline to ensure high quality through contextual translation and rephrasing techniques. This phase is crucial in allowing the model to generalize and align information extraction capabilities across unseen languages, as evidenced by experiments showing substantial improvements over existing models such as ChatGPT and state-of-the-art (SoTA) systems.

Experimental Insights

The experimental validation of AlignXIE demonstrates its capability to handle 63 IE benchmarks spanning multiple languages and settings effectively. The model achieved remarkable gains, outperforming ChatGPT by 30.17% and SoTA by 20.03% in cross-lingual settings. Additionally, in supervised evaluations, AlignXIE consistently ranked within the top-2 results across 40 out of 42 benchmarks, highlighting its robust performance across both Named Entity Recognition (NER) and Relation Extraction (RE) tasks in English and Chinese.

Theoretical and Practical Implications

Theoretically, AlignXIE underscores the potential of using code-based representations to unify and improve multilingual tasks. By ensuring schema and extraction consistency, the approach reduces semantic drift across languages. Practically, AlignXIE provides a framework that can be reused and adapted for new languages with minimal resource requirements, offering a scalable solution applicable to a variety of multilingual environments.

Future Directions

Future work might involve expanding the AlignXIE model to include a broader range of languages beyond the current focus on English and Chinese. Additionally, integrating other information extraction tasks such as Event Detection (ED) and Event Argument Extraction (EAE) during the cross-lingual alignment phase could further enhance the model's capabilities.

In conclusion, AlignXIE represents a significant step forward in refining multilingual IE systems, effectively bridging the gap between schema alignment and cross-lingual transfer, thereby optimizing performance in complex multilingual settings.