Papers
Topics
Authors
Recent
Search
2000 character limit reached

Proper Dataset Valuation by Pointwise Mutual Information

Published 28 May 2024 in cs.LG and cs.GT | (2405.18253v3)

Abstract: Data plays a central role in advancements in modern artificial intelligence, with high-quality data emerging as a key driver of model performance. This has prompted the development of principled and effective data curation methods in recent years. However, existing methods largely rely on heuristics, and whether they are truly effective remains unclear. For instance, standard evaluation methods that assess a trained model's performance on specific benchmarks may incentivize assigning high scores to data that merely resembles the test set. This issue exemplifies Goodhart's law: when a measure becomes a target, it ceases to be a good measure. To address this issue, we propose an information-theoretic framework for evaluating data curation methods. We define dataset quality in terms of its informativeness about the true model parameters, formalized using the Blackwell ordering of informativeness. Under this ordering, Blackwell's theorem ensures that more informative data yields optimal models with lower expected loss on the true underlying distribution. To measure informativeness, we show that the Blackwell order can be determined by the Shannon mutual information between the curated data and the test data. To estimate this mutual information, we introduce a novel method that trains Bayesian models on embedded datasets and computes mutual information from the posteriors of model parameters. Experiments on real-world data demonstrate that our mutual information-based evaluation assigns appropriately lower scores to data curation strategies that reduce dataset informativeness, while traditional test score-based evaluation methods may favor data curation strategies that overfit to the test set but compromise the training data's informativeness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Weight uncertainty in neural network. In International conference on machine learning, pages 1613–1622. PMLR, 2015.
  2. Why does throwing away data improve worst-group error? In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  3. Truthful data acquisition via peer prediction. Advances in Neural Information Processing Systems, 33:18194–18204, 2020.
  4. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33:2881–2891, 2020.
  5. Bayesian data analysis. Chapman and Hall/CRC, 1995.
  6. Data shapley: Equitable valuation of data for machine learning. In International conference on machine learning, pages 2242–2251. PMLR, 2019.
  7. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359–378, 2007.
  8. I. J. Good. Rational decisions. Journal of the Royal Statistical Society. Series B (Methodological), 14(1):107–114, 1952. ISSN 00359246. URL http://www.jstor.org/stable/2984087.
  9. Efficient task-specific data valuation for nearest neighbor algorithms. arXiv preprint arXiv:1908.08619, 2019.
  10. Opendataval: a unified benchmark for data valuation. arXiv preprint arXiv:2306.10577, 2023.
  11. Incentives for expressing opinions in online polls. In Proceedings of the 9th ACM Conference on Electronic Commerce, pages 119–128, 2008.
  12. Understanding black-box predictions via influence functions. In International conference on machine learning, pages 1885–1894. PMLR, 2017.
  13. Equilibrium selection in information elicitation without verification via information monotonicity. In 9th Innovations in Theoretical Computer Science Conference, 2018a.
  14. Water from two rocks: Maximizing the mutual information. In Proceedings of the 2018 ACM Conference on Economics and Computation, EC ’18, page 177–194, New York, NY, USA, 2018b. Association for Computing Machinery. ISBN 9781450358293. doi: 10.1145/3219166.3219194. URL https://doi.org/10.1145/3219166.3219194.
  15. Beta shapley: a unified and noise-reduced data valuation framework for machine learning. arXiv preprint arXiv:2110.14049, 2021.
  16. Datainf: Efficiently estimating data influence in lora-tuned llms and diffusion models. arXiv preprint arXiv:2310.00902, 2023.
  17. Eliciting informative feedback: The peer-prediction method. Management Science, pages 1359–1373, 2005.
  18. Kevin P Murphy. Machine learning: a probabilistic perspective. 2012.
  19. Frank Nielsen. On the jensen–shannon symmetrization of distances relying on abstract means. Entropy, 21(5):485, 2019.
  20. Trak: Attributing model behavior at scale. arXiv preprint arXiv:2303.14186, 2023.
  21. D. Prelec. A Bayesian Truth Serum for subjective data. Science, 306(5695):462–466, 2004.
  22. A robust bayesian truth serum for non-binary signals. In Proceedings of the 27th AAAI Conference on Artificial Intelligence (AAAI” 13), number EPFL-CONF-197486, pages 833–839, 2013.
  23. Incentives for truthful information elicitation of continuous signals. In Proceedings of the 28th AAAI Conference on Artificial Intelligence (AAAI” 14), number EPFL-CONF-215878, pages 770–776, 2014.
  24. Two strongly truthful mechanisms for three heterogeneous agents answering one question. In International Conference on Web and Internet Economics. Springer, 2020.
  25. Luke Tierney. Markov Chains for Exploring Posterior Distributions. The Annals of Statistics, 22(4):1701 – 1728, 1994. doi: 10.1214/aos/1176325750. URL https://doi.org/10.1214/aos/1176325750.
  26. Kernel smoothing. CRC press, 1994.
  27. Data banzhaf: A data valuation framework with maximal robustness to learning stochasticity. arXiv preprint arXiv:2205.15466, 2022.
  28. Peer prediction without a common prior. In Boi Faltings, Kevin Leyton-Brown, and Panos Ipeirotis, editors, Proceedings of the 13th ACM Conference on Electronic Commerce, EC 2012, Valencia, Spain, June 4-8, 2012, pages 964–981. ACM, 2012. doi: 10.1145/2229012.2229085. URL https://doi.org/10.1145/2229012.2229085.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.