COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values

Published 7 Apr 2025 in cs.CL | (2504.05535v1)

Abstract: Aligning LLMs with human preferences has achieved remarkable success. However, existing Chinese preference datasets are limited by small scale, narrow domain coverage, and lack of rigorous data validation. Additionally, the reliance on human annotators for instruction and response labeling significantly constrains the scalability of human preference datasets. To address these challenges, we design an LLM-based Chinese preference dataset annotation pipeline with no human intervention. Specifically, we crawled and carefully filtered 92k high-quality Chinese queries and employed 15 mainstream LLMs to generate and score chosen-rejected response pairs. Based on it, we introduce COIG-P (Chinese Open Instruction Generalist - Preference), a high-quality, large-scale Chinese preference dataset, comprises 1,009k Chinese preference pairs spanning 6 diverse domains: Chat, Code, Math, Logic, Novel, and Role. Building upon COIG-P, to reduce the overhead of using LLMs for scoring, we trained a 8B-sized Chinese Reward Model (CRM) and meticulously constructed a Chinese Reward Benchmark (CRBench). Evaluation results based on AlignBench \citep{liu2024alignbenchbenchmarkingchinesealignment} show that that COIG-P significantly outperforms other Chinese preference datasets, and it brings significant performance improvements ranging from 2% to 12% for the Qwen2/2.5 and Infinity-Instruct-3M-0625 model series, respectively. The results on CRBench demonstrate that our CRM has a strong and robust scoring ability. We apply it to filter chosen-rejected response pairs in a test split of COIG-P, and our experiments show that it is comparable to GPT-4o in identifying low-quality samples while maintaining efficiency and cost-effectiveness. Our codes and data are released in https://github.com/multimodal-art-projection/COIG-P.

Abstract PDF Upgrade to Chat

Authors (32)

First 10 authors:

Summary

Evaluating the COIG-P Dataset for Application in Chinese Language Model Alignment

The development of Large Language Models (LLMs) requires high-quality datasets to align with human preferences in different linguistic and cultural contexts. The paper introduces COIG-P, an extensive Chinese preference dataset specifically curated for the alignment of LLMs to human values. The framework addresses critical gaps in existing datasets, such as scale, domain diversity, and data quality, particularly within Chinese Natural Language Processing (NLP).

Methodological Advancements

The contribution of COIG-P is significant in its methodological approach to data curation. The researchers developed an LLM-based Chinese preference dataset annotation pipeline devoid of direct human intervention. The pipeline involved scraping, filtering, and validating 92,000 queries, from which 15 mainstream LLMs generated response pairs. A unique scoring mechanism across 8 disparate LLMs selected optimal response pairs, bypassing the scalability issues related to human annotation. This process resulted in a dataset comprising 1,006,000 preference pairs across six domains: Chat, Code, Math, Logic, Novel, and Role.

Impactful Results

The efficacy of COIG-P was demonstrated through comprehensive testing using AlignBench, showcasing significant LLM performance improvements ranging from 2% to 12% over a spectrum of domains. For series like Qwen2/2.5-7B and Infinity-Instruct-3M-0625, this leads to tangible advancements in understanding and aligning with human preferences. Such increments suggest the data's robustness in enhancing model capabilities in specific domains, particularly where prior datasets have lagged.

Implications and Future Prospects

A critical advancement from this dataset lies in its implications for developing LLMs in Chinese contexts, previously underserved in AI research. By providing a detailed, large-scale dataset, COIG-P positions itself as invaluable for future research in aligning models to complex, multidimensional human values, particularly in Chinese linguistic environments.

Additionally, COIG-P aids in refining the benchmarks for human value alignment in machine learning, encouraging transparency in reward models, like the Chinese Reward Model (CRM) that was trained and evaluated against a newly set-up benchmark, CRBench. The CRM demonstrated competitive performance levels compared to closed-source models, enhancing both efficiency and cost-effectiveness, which presents a new pathway for future model testing and refinement.

In summary, the introduction of COIG-P represents a paradigm shift in constructing datasets for large-scale preference alignment tasks. Its implications can't be overstated, as it provides a formidable tool for enhancing language understanding in Chinese LLMs and refining modeling strategies to align with divergent cultural and linguistic human values. The dataset serves as an exemplar of scalable and reproducible data curation practices, setting a benchmark for future research endeavors in AI alignment.