Harnessing Diversity for Important Data Selection in Pretraining Large Language Models

Published 25 Sep 2024 in cs.AI | (2409.16986v2)

Abstract: Data selection is of great significance in pre-training LLMs, given the variation in quality within the large-scale available training corpora. To achieve this, researchers are currently investigating the use of data influence to measure the importance of data instances, $i.e.,$ a high influence score indicates that incorporating this instance to the training set is likely to enhance the model performance. Consequently, they select the top-$k$ instances with the highest scores. However, this approach has several limitations. (1) Computing the influence of all available data is time-consuming. (2) The selected data instances are not diverse enough, which may hinder the pre-trained model's ability to generalize effectively to various downstream tasks. In this paper, we introduce \texttt{Quad}, a data selection approach that considers both quality and diversity by using data influence to achieve state-of-the-art pre-training results. In particular, noting that attention layers capture extensive semantic details, we have adapted the accelerated $iHVP$ computation methods for attention layers, enhancing our ability to evaluate the influence of data, $i.e.,$ its quality. For the diversity, \texttt{Quad} clusters the dataset into similar data instances within each cluster and diverse instances across different clusters. For each cluster, if we opt to select data from it, we take some samples to evaluate the influence to prevent processing all instances. To determine which clusters to select, we utilize the classic Multi-Armed Bandit method, treating each cluster as an arm. This approach favors clusters with highly influential instances (ensuring high quality) or clusters that have been selected less frequently (ensuring diversity), thereby well balancing between quality and diversity.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper presents Quad, a method that enhances LLM pretraining by combining quality assessment and diversity in data selection.
It employs transformer attention layers and accelerated iHVP to efficiently evaluate data instance influence, reducing computational overhead.
The approach uses clustering and a Multi-Armed Bandit framework to balance quality and diversity, improving model generalization in downstream tasks.

The paper discusses the development of a novel data selection approach named \texttt{Quad} to improve the pre-training of LLMs. The focus lies on addressing limitations in traditional data selection methods that prioritize high-quality data instances based on their influence scores.

Key Limitations in Existing Approaches

Time-Consuming Influence Computation:
- Traditional methods evaluate the influence of all available data instances. This process is computationally expensive and impractical for large-scale corpora.
Lack of Diversity:
- Standard influence-based methods often result in a selection of data instances that lack sufficient diversity. This shortfall can impair a model's ability to generalize across various downstream tasks.

Introduction of \texttt{Quad}

To overcome these challenges, the paper proposes \texttt{Quad}, a method that integrates both quality and diversity into the data selection process.

Quality Assessment

Leveraging Attention Layers:
- Attention layers in transformer models capture intricate semantic details. \texttt{Quad} makes use of these layers by adapting accelerated Inverse Hessian Vector Product ( $iHVP$ ) computation methods specifically for attention layers.
- This enhancement allows for efficient and effective evaluation of the influence of data instances, thus improving quality assessments.

Diversity Mechanism

Clustering for Diversity:
- \texttt{Quad} clusters the dataset into groups of similar data instances (within-cluster similarity) and diverse instances across different clusters (inter-cluster diversity).
- Within each cluster, a subset of data is sampled to evaluate its influence, avoiding the need to evaluate all instances.
Multi-Armed Bandit Model:
- Clusters are treated as different "arms" in a Multi-Armed Bandit (MAB) framework.
- This model balances between selecting clusters rich in influential instances (ensuring high quality) and less frequently selected clusters (ensuring diversity).
- This ensures a dynamic and balanced selection process, promoting both high-quality and diverse data instances.

Implications

By integrating these mechanisms, \texttt{Quad} achieves a more effective pre-training process for LLMs. The approach is both computationally feasible and capable of enhancing the model's generalization abilities across a diverse array of downstream tasks.

The introduction of \texttt{Quad} represents a significant advancement in data selection methodologies, addressing two major constraints: the computational burden of evaluating data influence and the lack of diversity in selected data instances.

Markdown Report Issue