- The paper introduces GRAPE, a supervised finetuning framework that improves LLM performance by selecting instruction data aligned with the target model's pretrained distribution.
- GRAPE selects responses for SFT by choosing candidates with the highest probability according to the target model's distribution, requiring only an efficient forward pass.
- Experiments show GRAPE significantly outperforms baselines, achieving better results with less data than conventional SFT methods and other data selection approaches.
The paper "The Best Instruction-Tuning Data are Those That Fit" introduces a novel supervised finetuning (SFT) framework called GRAPE, which addresses the challenge of distribution shifts when finetuning LLMs. GRAPE selects SFT data that aligns with the target model's pretrained distribution, thereby improving performance and robustness.
The authors note that conventional SFT data often comprises instructions paired with multiple responses sampled from other LLMs, which may fall outside the target model's distribution. This can lead to diminishing returns and hurt the models' performance and robustness.
GRAPE addresses this issue by gathering responses from various LLMs for each instruction and selecting the response with the highest probability as measured by the target model. This ensures that the selected data closely aligns with the target model's pretrained distribution. After this selection, standard SFT training is performed.
The methodology is divided into two steps:
- Response Collection: Collect a pool of candidate responses from existing datasets or by sampling from multiple LLMs.
- Customization: Select the response(s) for each instruction that are closest to the pretrained distribution of the target model.
The probability of each response is calculated using the target model, and the responses are ranked accordingly. The GRAPE selection process requires only a forward pass through the candidates, making it computationally efficient.
The authors conducted experiments to validate the effectiveness of GRAPE. They first performed controlled experiments using the UltraInteract dataset, fine-tuning models such as Llama3.1-8B, Mistral-7B, and Qwen2.5-7B on GRAPE-selected data. The results showed that GRAPE significantly outperformed strong baselines, including distilling from the strongest model (up to 13.8% absolute gain) and training on 3x more data (up to 17.3% performance improvements).
The framework's effectiveness was further validated in realistic settings using post-training data from Tulu3 and Olmo-2. GRAPE outperformed strong baselines with 4.5 times the data by 6.1% and a state-of-the-art data selection approach by 3.9% on average performance. Additionally, Llama3.1-8B fine-tuned with GRAPE surpassed the performance of Tulu3-SFT using 1/3 of the data and half the number of epochs.
The authors draw an analogy from reinforcement learning (RL) and preference learning, noting the importance of matching the training data distribution with the policy. They hypothesize that SFT can benefit from aligning data with the model's base distribution to minimize distribution shift, improve data efficiency, and enhance performance.
The paper includes a discussion on the limitations of strictly sampling responses from the base model, which can lead to instability, bias reinforcement, knowledge stagnation, and overfitting. The authors advocate for gathering and selecting responses from various sources to stay in-distribution while providing effective supervision to the base model.
The paper compares GRAPE with existing perplexity-based data selection methods, highlighting that GRAPE uses probability to select responses that better match the base model's distribution for each instruction in a fixed instruction set, rather than selecting instructions based on perplexity.
The experimental setup involves using the UltraInteract-SFT dataset for chain-of-thought reasoning tasks, with responses collected from models like Mixtral-7x7B-Instruct, Codestral-22B, Mistral-Small, Llama-3.1-70B-Instruct, Llama-3.1-405B-Instruct, and Qwen2.5-72B-Chat. The models were evaluated on coding and math reasoning benchmarks, including HumanEval, MBPP, LeetCode, MATH, GSM-Plus, and TheoremQA.
The results demonstrate that GRAPE consistently outperforms baselines, including the original UltraInteract-SFT dataset and responses from the strongest model under consideration. The authors emphasize that customization for base models should be prioritized over identifying the highest-quality responses.
In a real-world SFT dataset experiment using Tulu-3 and Olmo-2 data, GRAPE was evaluated on benchmarks such as LeetCode, MATH, BigBenchHard (BBH), MMLU, and AlpacaEval-V2. The results show that models fine-tuned on GRAPE-selected responses outperformed strong baselines, including training over all available data.
Further experiments using OpenHermes-2.5 also showed that GRAPE-selected data yields better performances, reaffirming its effectiveness in enhancing SFT performance.
The paper includes a discussion on the scenario where all responses are from the same LLM (LM), showing that GRAPE can still optimize responses from a single generator. However, the authors also note that in-distribution alignment is not a silver bullet and that self-distillation can lead to performance degradation due to reduced solution diversity.
Finally, the paper concludes by emphasizing that GRAPE is a simple yet highly effective approach to improve supervised fine-tuning, building on the hypothesis that instruction tuning data should better match the base model's distribution to optimize the training outcome.