SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe

Published 7 Oct 2024 in cs.CL, cs.AI, and cs.LG | (2410.05248v2)

Abstract: To acquire instruction-following capabilities, LLMs undergo instruction tuning, where they are trained on instruction-response pairs using next-token prediction (NTP). Efforts to improve instruction tuning often focus on higher-quality supervised fine-tuning (SFT) datasets, typically requiring data filtering with proprietary LLMs or human annotation. In this paper, we take a different approach by proposing SFTMix, a novel Mixup-based recipe that elevates LLM instruction tuning beyond the conventional NTP paradigm, without relying on well-curated datasets. Observing that LLMs exhibit uneven confidence across the semantic representation space, we argue that examples with different confidence levels should play distinct roles in instruction tuning--confident data is prone to overfitting, while unconfident data is harder to generalize. Based on this insight, SFTMix leverages training dynamics to identify examples with varying confidence levels, interpolates them to bridge the confidence gap, and applies a Mixup-based regularization to support learning on these additional, interpolated examples. By propagating supervision signals across confidence regions and encouraging linear behavior between them, SFTMix mitigates overfitting in confident examples while enhancing generalization in unconfident ones. We demonstrate the effectiveness of SFTMix in both instruction-following and healthcare-specific SFT tasks, with consistent improvements across LLM families and SFT datasets of varying sizes and qualities. Extensive analyses across six directions highlight SFTMix's compatibility with data selection, adaptability to compute-constrained scenarios, and scalability to broader applications.

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces SFTMix, a mixup-based method to enhance LLM instruction tuning by leveraging model confidence for data segmentation.
It integrates mixup regularization with next-token prediction to reduce overfitting and improve task generalization across various domains.
Experimental results show enhanced multi-turn conversation metrics and a 1.5% accuracy improvement in healthcare-specific tasks.

SFTMix: Enhancing LLM Instruction Tuning Through Mixup Regularization

The paper introduces SFTMix, a methodological advance in the instruction-tuning of LLMs by exploiting a novel Mixup-based approach to enhance performance without reliance on well-curated datasets. Conventional instruction-tuning methodologies employ next-token prediction (NTP) utilizing high-quality supervised fine-tuning (SFT) datasets, often necessitating expensive data filtering and preparation processes. SFTMix transcends these limitations by harnessing the inherent characteristics of datasets and leveraging training dynamics to improve fine-tuning efficiency and efficacy.

Methodology Overview

The novelty of SFTMix is rooted in the observation that LLM confidence varies across the semantic space during instruction tuning. By identifying data subsets based on confidence levels using perplexity metrics at multiple training checkpoints, SFTMix separates the SFT dataset into confident and relatively unconfident subsets. Mixup, traditionally used for regularization in deep learning, is adapted to this context to generate interpolated data instances from these subsets, acting as a regularization mechanism.

The Mixup-based regularization mitigates overfitting on confident examples and propagates supervisory signals to less confident ones. By integrating this regularization with the NTP loss, SFTMix enhances generalization across a range of tasks, exhibiting robustness across LLM architectures and dataset scales.

Experimental Findings

The empirical evaluation against baseline NTP instruction-tuning underscores the efficacy of SFTMix. Notable performance improvements are recorded across various instruction-following and healthcare domain-specific tasks:

Instruction-Following Tasks: SFTMix consistently outperformed NTP, with enhancements in multi-turn conversational contexts as evidenced by results on MT-Bench and AlpacaEval-2. Evaluations show significant gains in single-turn and multi-turn conversation metrics, with observable improvements in diverse task categories such as extraction and coding.
Healthcare Domain-Specific Tasks: In specialized domains, SFTMix demonstrated a $1.5\%$ average increase in accuracy over NTP across medical benchmarks like MedQA and PubMedQA, outperforming existing domain-specific models.

Implications and Future Directions

From a theoretical perspective, SFTMix's ability to leverage model-specific training dynamics introduces a promising pathway to reduce reliance on costly dataset curation without sacrificing performance. This technique encourages a rethinking of data utilization strategies in LLM instruction tuning.

Practical implications of SFTMix include enhanced scalability and adaptability to varied tasks, paving the way for cost-effective and efficient deployment of LLMs in both general and domain-specific contexts. The reduced overfitting and improved generalization performance underscore its potential utility in real-world applications.

Future work could explore the integration of SFTMix with parameter-efficient training methods or apply it to larger models and diverse datasets. The potential for scaling SFTMix to pre-training stages or integrating it with emerging AI methodologies could further broaden its applicability and impact on advancing NLP technologies.

In conclusion, SFTMix represents a significant methodological advance in instruction tuning, offering a refined approach to managing and exploiting training data's intrinsic variability. It delivers consistent performance enhancements, establishing its value across the spectrum of NLP applications.

Markdown Report Issue