ICONS: Influence Consensus for Vision-Language Data Selection

Published 31 Dec 2024 in cs.CV, cs.CL, and cs.LG | (2501.00654v3)

Abstract: Training vision-LLMs via instruction tuning often relies on large mixtures of data spanning diverse tasks and domains. However, these mixtures frequently include redundant information, increasing computational costs without proportional performance gains, necessitating more effective data selection strategies. Existing methods typically rely on task-agnostic heuristics to estimate data importance or focus on optimizing single tasks in isolation, limiting their effectiveness in multitask settings. In this work, we introduce ICONS, a gradient-based Influence CONsensus approach for vision-language data Selection. Our method leverages first-order training dynamics to estimate the influence of individual training examples on validation performance and aggregates these estimates across tasks via majority voting over task-specific influences. This cross-task consensus identifies data points that are consistently valuable across tasks, enabling us to prioritize examples that drive overall performance. The voting-based design further mitigates issues such as score calibration and outlier sensitivity, resulting in robust and scalable data selection for diverse multitask mixtures. With only 20% of the data from LLaVA-665K and Cambrian-7M, our selected subsets retain 98.6% and 98.8% of the performance achieved with full datasets, and can even surpass full data training at a 60% selection ratio on LLaVA-665K. Our approach also generalizes to unseen tasks and architectures, demonstrating strong transfer. We release two compact, high-utility subsets, LLaVA-ICONS-133K and Cambrian-ICONS-1.4M, preserving impactful training examples for efficient and scalable vision-LLM development.

Abstract PDF Upgrade to Chat

Summary

The paper presents a two-stage influence consensus framework that retains only 20% of data while achieving 98.6% of full dataset performance.
It employs gradient-based influence estimation and a majority voting mechanism to select high-impact samples for multi-task learning.
Empirical evaluations show ICONS outperforms several baselines, offering a resource-efficient approach for sustainable multimodal model training.

An Expert Review of ICONS: Influence Consensus for Vision-Language Data Selection

The paper "ICONS: Influence Consensus for Vision-Language Data Selection" introduces a novel methodology aimed at improving the efficiency of visual instruction tuning for multimodal LLMs. This work addresses the challenges inherent in managing and processing large-scale vision-language datasets, which often contain redundant examples that increase computational costs without contributing proportionally to model performance. The authors propose the ICONS framework—an influence consensus approach—that identifies a compact yet highly informative subset of data for multi-task learning.

Summary of Contributions

ICONS presents a two-stage data selection framework combining task-specific influence estimation and cross-task consensus to prioritize data samples that maximize performance across multiple tasks. The key contributions of the paper include:

Task-Specific Influence Estimation: The framework starts with a "specialist" stage where the influence of each training sample is computed per task using gradient-based methods. This step leverages gradient alignment techniques to evaluate how each training data point impacts individual validation tasks.
Cross-Task Influence Consensus: The "generalist" stage uses a majority voting mechanism to aggregate influence scores across different tasks. This approach ensures the selection of samples that exhibit broad utility rather than those tailored to specific tasks.
Efficiency and Transferability: The proposed methodology achieves significant efficiency by retaining only 20% of a large training dataset (LLaVA-665K), while achieving 98.6% of the performance of models trained on the full dataset. Furthermore, the reduced dataset, LLaVA-ICONS-133K, demonstrates strong transferability to unseen tasks, maintaining robust performance across different contexts.

Methodological Insights

The ICONS method is innovative in its use of gradient-based influence functions to drive data selection. Specifically, it calculates gradient alignment between training and validation samples to estimate influence, representing a computationally efficient alternative to traditional influence functions which require second-order derivative information. The voting-based strategy to aggregate task-specific scores into a unified selection criterion is particularly effective in identifying beneficial data for multi-task objectives. This reduces the risk of overfitting to individual task idiosyncrasies and enhances the model's generalization abilities.

Empirical Evaluation

The authors present an extensive empirical evaluation, demonstrating ICONS' superiority over several baseline methods including random selection, semantic deduplication, and language-model-based metrics like EL2N and Perplexity. ICONS consistently outperforms these baselines across diverse benchmarks encompassing tasks in visual question answering (VQAv2, GQA), text understanding (TextVQA), and scientific reasoning (ScienceQA).

Noteworthy is the framework's ability to maintain or even exceed full-dataset performance on some tasks under selected training regimes. The discussion on selecting an appropriate compact dataset is supported by rigorous quantitative assessments, including ablation studies and scalability analyses, reinforcing the practical viability of ICONS.

Implications and Future Directions

The practical implications of ICONS extend to the broader field of AI model training, especially in scenarios where computational resources and data management are significant constraints. By effectively reducing the data footprint without compromising model capability, ICONS offers a pathway towards more sustainable AI development.

Theoretically, the approach invites future exploration into more advanced consensus mechanisms that might dynamically adjust the voting strategy based on task complexity or importance. Additionally, extensions could consider incorporating task-agnostic influence measures or adapting the framework to other domains beyond vision-language tasks, potentially enhancing its applicability to a wider array of multimodal datasets.

Overall, ICONS proposes a robust framework for efficient data selection, laying the groundwork for advancements in resource-efficient model training in multimodal AI systems. Its insights into influence-driven data optimization present valuable directions for future research endeavors in the intersection of vision and language learning.

Markdown Report Issue