Papers
Topics
Authors
Recent
Search
2000 character limit reached

Promises and Pitfalls of Threshold-based Auto-labeling

Published 22 Nov 2022 in cs.LG, cs.AI, and stat.ML | (2211.12620v2)

Abstract: Creating large-scale high-quality labeled datasets is a major bottleneck in supervised machine learning workflows. Threshold-based auto-labeling (TBAL), where validation data obtained from humans is used to find a confidence threshold above which the data is machine-labeled, reduces reliance on manual annotation. TBAL is emerging as a widely-used solution in practice. Given the long shelf-life and diverse usage of the resulting datasets, understanding when the data obtained by such auto-labeling systems can be relied on is crucial. This is the first work to analyze TBAL systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data. Our results provide two crucial insights. First, reasonable chunks of unlabeled data can be automatically and accurately labeled by seemingly bad models. Second, a hidden downside of TBAL systems is potentially prohibitive validation data usage. Together, these insights describe the promise and pitfalls of using such systems. We validate our theoretical guarantees with extensive experiments on synthetic and real datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. YOLO9000: better, faster, stronger. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6517–6525, 2017.
  2. Attention is all you need. In Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS 2017), 2017.
  3. Rlc-gnn: An improved deep architecture for spatial-based graph neural network with application to fraud detection. Applied Sciences, 11, 06 2021.
  4. Learning from crowds. Journal of Machine Learning Research, 11(43):1297–1322, 2010.
  5. Crowdsourced clustering: Querying edges vs triangles. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, 2016.
  6. S. Fralick. Learning to recognize patterns without a teacher. IEEE Transactions on Information Theory, 13(1):57–64, 1967.
  7. SGT. Aws sagemaker ground truth. https://aws.amazon.com/sagemaker/data-labeling/, 2022. Accessed: 2022-11-18.
  8. Superb-AI. Superb ai automated data labeling service. https://www.superb-ai.com/product/automate, 2022. Accessed: 2022-11-18.
  9. Samsung-SDS. Samsung sds auto-labeling service. https://www.samsungsds.com/en/insights/ TechToolkit_2021_Auto_Labeling.html, 2022. Accessed: 2022-11-18.
  10. Airbus. Airbus Active Labeling Blog. https://acubed.airbus.com/blog/wayfinder/automatic-data-labeling-strategies-for-vision-based-machine-learning-and-ai/, 2022. Accessed: 2022-11-18.
  11. Curator: A No-Code Self-Supervised Learning and Active Labeling Tool to Create Labeled Image Datasets from Petabyte-Scale Imagery. In EGU General Assembly Conference Abstracts, pages EGU21–6853, April 2021.
  12. Imagenet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  13. Burr Settles. Active learning literature survey. 2009.
  14. On the foundations of noise-free selective classification. JMLR, 11:1605–1641, aug 2010. ISSN 1532-4435.
  15. Steve Hanneke. Theory of disagreement-based active learning. Found. Trends Mach. Learn., 7(2–3):131–309, jun 2014. ISSN 1935-8237.
  16. William Feller. Generalization of a probability limit theorem of cramér. In Transactions of the American Mathematical Society, pages 361–372,, 1943.
  17. Active testing: Sample-efficient model evaluation. International Conference on Machine Learning, 2021.
  18. Ya Le and Xuan Yang. Tiny imagenet dataset. 2015. Accessed: May 16, 2023.
  19. Learning transferable visual models from natural language supervision, 2021.
  20. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics.
  21. C-pack: Packaged resources to advance general chinese embedding, 2023.
  22. Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316, 2022.
  23. Hugging Face. Hugging face mteb leaderboard. https://huggingface.co/spaces/mteb/leaderboard, 2023. Accessed: 2023-08-07.
  24. Learning multiple layers of features from tiny images. 2009.
  25. Probability estimates for multi-class classification by pairwise coupling. Advances in Neural Information Processing Systems, 16, 2003.
  26. John Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999.
  27. Sanjoy Dasgupta. Two faces of active learning. Theoretical Computer Science, 412(19):1767–1781, 2011. Algorithmic Learning Theory (ALT 2009).
  28. Daniel Joseph Hsu. Algorithms for active learning. PhD thesis, UC San Diego, 2010.
  29. Batch active learning at scale. In Advances in Neural Information Processing Systems, volume 34, pages 11933–11944, 2021.
  30. A survey of deep active learning, 2020.
  31. Support vector machine active learning with applications to text classification. Journal of machine learning research, 2(Nov):45–66, 2001.
  32. On the relationship between data efficiency and error for uncertainty sampling. In Proceedings of the 35th International Conference on Machine Learning, ICML, 2018.
  33. Improving generalization with active learning. Machine Learning, 15(2):201–221, 1994.
  34. Steve Hanneke. A bound on the label complexity of agnostic active learning. ICML, 2007.
  35. Margin based active learning. In COLT, 2007.
  36. Active and passive learning of linear separators under log-concave distributions. In Conference on Learning Theory, pages 288–316. PMLR, 2013.
  37. Importance weighted active learning. In Proceedings of the 26th annual international conference on machine learning, pages 49–56, 2009.
  38. Convergence rates of active learning for maximum likelihood estimation. In Proceedings of the 28th International Conference on Neural Information Processing Systems, 2015.
  39. Analysis of perceptron-based active learning. In International conference on computational learning theory, pages 249–263. Springer, 2005.
  40. Sanjoy Dasgupta. Coarse sample complexity bounds for active learning. In Advances in Neural Information Processing Systems, volume 18, 2006.
  41. Active learning for cost-sensitive classification. In International Conference on Machine Learning, pages 1915–1924. PMLR, 2017.
  42. Improved algorithms for agnostic pool-based active classification. In International Conference on Machine Learning, pages 5334–5344. PMLR, 2021.
  43. Matti Kääriäinen. Active learning in the non-realizable case. In International Conference on Algorithmic Learning Theory, pages 63–77. Springer, 2006.
  44. C Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on information theory, 16(1):41–46, 1970.
  45. Active learning for classification with abstention. IEEE Journal on Selected Areas in Information Theory, 2(2):705–719, 2021.
  46. Online active learning of reject option classifiers. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04):5652–5659, 2020.
  47. Exponential savings in agnostic active learning through abstention. Proceedings of Machine Learning Research, pages 3806–3832. PMLR, 2021.
  48. Efficient active learning with abstention. arXiv preprint arXiv:2204.00043, 2022.
  49. Learning with rejection. In International Conference on Algorithmic Learning Theory, pages 67–82. Springer, 2016.
  50. Agnostic selective classification. Advances in neural information processing systems, 24, 2011.
  51. Active learning via perfect selective classification. Journal of Machine Learning Research, 13(2), 2012.
  52. Agnostic pointwise-competitive selective classification. Journal of Artificial Intelligence Research, 52:171–201, 2015.
  53. The relationship between agnostic selective classification, active learning and the disagreement coefficient. The Journal of Machine Learning Research, 20(1):1136–1173, 2019.
  54. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems, volume 30, 2017.
  55. Distribution-free performance bounds with the resubstitution error estimate. Pattern Recognition Letters, 13(11):757–764, 1992.
  56. Minimum cost active labeling. arXiv preprint arXiv:2006.13999, 2020.
  57. Why relu networks yield high-confidence predictions far away from the training data and how to mitigate the problem. 2018.
  58. A survey of uncertainty in deep neural networks. arXiv preprint arXiv:2107.03342, 2021.
  59. Revisiting the calibration of modern neural networks. In Advances in Neural Information Processing Systems, volume 34, pages 15682–15694, 2021.
  60. Rethinking calibration of deep neural networks: Do not be afraid of overconfidence. In Advances in Neural Information Processing Systems, volume 34, pages 11809–11820, 2021a.
  61. Improving model calibration with accuracy versus uncertainty optimization. Advances in Neural Information Processing Systems, 33:18237–18248, 2020.
  62. Improving uncertainty calibration of deep neural networks via truth discovery and geometric optimization. In Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, volume 161 of Proceedings of Machine Learning Research, pages 75–85. PMLR, 27–30 Jul 2021.
  63. Towards calibrated and scalable uncertainty representations for neural networks. arXiv preprint arXiv:1911.00104, 2019.
  64. Data programming: Creating large training sets, quickly. In Proceedings of the 29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 2016.
  65. Fast and three-rious: Speeding up weak supervision with triplet methods. In Proceedings of the 37th International Conference on Machine Learning (ICML 2020), 2020.
  66. Universalizing weak supervision. In International Conference on Learning Representations, 2022.
  67. Lifting weak supervision to structured prediction. In Advances in Neural Information Processing Systems, 2022.
  68. Understanding how dimension reduction tools work: An empirical approach to deciphering t-sne, umap, trimap, and pacmap for data visualization. Journal of Machine Learning Research, 22(201):1–73, 2021b.
  69. Learning with deep cascades. In Proceedings of the Twenty-Sixth International Conference on Algorithmic Learning Theory (ALT 2015), 2015.
  70. Probability in banach spaces. 1991.
  71. Foundations of Machine Learning. The MIT Press, 2012. ISBN 026201825X.
  72. Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
  73. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  74. PyTorch. Cifar-10 pytorch tutorial. https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html, 2022. Accessed: 2022-11-18.
  75. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006.
Citations (4)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.