Training Data Influence Analysis and Estimation: A Survey

Published 9 Dec 2022 in cs.LG | (2212.04612v3)

Abstract: Good models require good training data. For overparameterized deep models, the causal relationship between training data and model predictions is increasingly opaque and poorly understood. Influence analysis partially demystifies training's underlying interactions by quantifying the amount each training instance alters the final model. Measuring the training data's influence exactly can be provably hard in the worst case; this has led to the development and use of influence estimators, which only approximate the true influence. This paper provides the first comprehensive survey of training data influence analysis and estimation. We begin by formalizing the various, and in places orthogonal, definitions of training data influence. We then organize state-of-the-art influence analysis methods into a taxonomy; we describe each of these methods in detail and compare their underlying assumptions, asymptotic complexities, and overall strengths and weaknesses. Finally, we propose future research directions to make influence analysis more useful in practice as well as more theoretically and empirically sound. A curated, up-to-date list of resources related to influence analysis is available at https://github.com/ZaydH/influence_analysis_papers.

Abstract PDF Upgrade to Chat

Citations (59)

View on Semantic Scholar

Summary

Survey of Training Data Influence Analysis and Estimation

The paper "Training Data Influence Analysis and Estimation: A Survey" authored by Zayd Hammoudeh and Daniel Lowd provides a comprehensive overview of methods for assessing the influence of training data on model predictions. As machine learning models have grown increasingly complex and data-driven, understanding the dynamics between training data and model behavior has become crucial. This is particularly relevant for overparameterized models where the intricacy of relationships between data and predictions can obscure the origins of specific model outputs.

The authors begin by highlighting the significance of "good models require good training data," especially when dealing with large datasets from diverse sources, such as uncurated internet data. They emphasize how anomalies, including distribution shifts, labeling errors, and adversarial inputs, impact model performance and bias, potentially leading to severe consequences. To tackle these challenges, influence analysis offers a lens through which one can examine the contributions of individual training examples to the final model.

Overview of Influence Analysis Methods

The paper classifies influence analysis methods into two groups: retraining-based methods and gradient-based estimators. Retraining-based methods, such as leave-one-out analysis, directly measure the change in model predictions when a training instance is removed. While conceptually straightforward, these methods are computationally expensive. To mitigate this, techniques like Downsampling and Shapley Value analysis approximate the effects of removed data without the need for exhaustive retraining.

Gradient-based influence estimators, including influence functions and representer-point methods, employ first-order Taylor approximations to alleviate some computational burdens by leveraging the gradients of model predictions. They presume differentiability and, oftentimes, convexity, but these assumptions might be invalid for state-of-the-art deep learning models.

Method Comparisons and Complexity

The authors provide a detailed comparison of various methods, outlining their assumptions, computational requirements, and applicability to different model classes, as summarized in Table~\ref{tab:InfluenceEstimators:Comparison}. Retraining-based methods generally exhibit high upfront computational costs but benefit from amortization across multiple instances, whereas gradient-based estimators can provide quick assessments for individual test instances but are sensitive to model specifics, such as hyperparameters and learning rates.

Challenges and Implications

Influence analysis faces several challenges, including the requirement for accurate estimators that scale with large datasets, and the need for more robust methods that account for group effects among training data. Models often contain biases rooted in the training data, reflecting broader societal biases, thus accentuating the need for fair and interpretable AI. Furthermore, influence analysis is essential in adversarial settings where malicious inputs may seek to manipulate model outcomes.

Future Directions

The paper outlines several key areas for future research. Developing certified influence estimations can provide guarantees against adversarial data manipulations, while improved estimator scalability can facilitate real-time application in large-scale and enterprise systems. Recognizing the collective impact of data subsets over individual data points aligns influence analysis more closely with practical applications. Incorporating influence analysis in active learning and data curation processes offers the potential for increased data efficiency and model performance.

In conclusion, Hammoudeh and Lowd's survey serves as a valuable resource for experienced researchers seeking to navigate the multifaceted landscape of training data influence analysis. It underscores the importance of these methods in enhancing the transparency, reliability, and ethical alignment of machine learning systems while acknowledging the computational and theoretical complexities that accompany this field.

Markdown Report Issue