Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost
Abstract: Current foundation models have shown impressive performance across various tasks. However, several studies have revealed that these models are not effective for everyone due to the imbalanced geographical and economic representation of the data used in the training process. Most of this data comes from Western countries, leading to poor results for underrepresented countries. To address this issue, more data needs to be collected from these countries, but the cost of annotation can be a significant bottleneck. In this paper, we propose methods to identify the data to be annotated to balance model performance and annotation costs. Our approach first involves finding the countries with images of topics (objects and actions) most visually distinct from those already in the training datasets used by current large vision-language foundation models. Next, we identify countries with higher visual similarity for these topics and show that using data from these countries to supplement the training data improves model performance and reduces annotation costs. The resulting lists of countries and corresponding topics are made available at https://github.com/MichiganNLP/visual_diversity_budget.
- Daniel Ahfock and Geoffrey J McLachlan. 2023. Semi-supervised learning of classifiers from a statistical perspective: A brief review. Econometrics and Statistics, 26:124–138.
- The power of ensembles for active learning in image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9368–9377.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
- Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pages 77–91. PMLR.
- Multimodal analysis and prediction of latent user dimensions. In Social Informatics.
- What vision-language modelssee’when they see scenes. arXiv preprint arXiv:2109.07301.
- The limits of global inclusion in ai development. ArXiv, abs/2102.01265.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568.
- Mining semantic affordances of visual object categories. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4259–4267.
- Semi-supervised and unsupervised deep visual learning: A survey. IEEE transactions on pattern analysis and machine intelligence.
- Batch active learning at scale. Advances in Neural Information Processing Systems, 34:11933–11944.
- Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning, 20(3):273–297.
- Does object recognition work for everyone? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 52–59.
- Imagenet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255.
- An image is worth 16x16 words: Transformers for image recognition at scale. ICLR.
- Adaptive methods for real-world domain generalization. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14335–14344.
- Not all labels are equal: Rationalizing the labeling costs for training object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14492–14501.
- Meng Fang and Trevor Cohn. 2017. Model transfer for tagging low-resource languages using a bilingual dictionary. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 587–593, Vancouver, Canada. Association for Computational Linguistics.
- Christiane D. Fellbaum. 2000. Wordnet : an electronic lexical database. Language, 76:706.
- Statistics (international student edition). Pisani, R. Purves, 4th edn. WW Norton & Company, New York.
- Karl Pearson F.R.S. 1901. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572.
- Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059. PMLR.
- Deep bayesian active learning with image data. In International conference on machine learning, pages 1183–1192. PMLR.
- Uncurated image-text datasets: Shedding light on demographic bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6957–6966.
- Identifying cross-cultural differences in word usage. In International Conference on Computational Linguistics.
- Timnit Gebru. 2020. Race and gender. The Oxford handbook of ethics of aI, pages 251–269.
- Babel-imagenet: Massively multilingual evaluation of vision-and-language representations. arXiv preprint arXiv:2306.08658.
- Vision models are more robust and fair when pretrained on uncurated images without supervision. ArXiv, abs/2202.08360.
- Fairness indicators for systematic assessments of visual feature extractors. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency.
- Pinpointing why object recognition performance degrades across income levels and geographies. ArXiv, abs/2304.05391.
- Mohamed Farouk Abdel Hady and Friedhelm Schwenker. 2013. Semi-supervised learning. Handbook on Neural Information Processing, pages 215–239.
- Towards reliable assessments of demographic disparities in multi-label image classifiers. arXiv preprint arXiv:2302.08572.
- Sariel Har-Peled and Akash Kushal. 2005. Smaller coresets for k-median and k-means clustering. In Proceedings of the twenty-first annual symposium on Computational geometry, pages 126–134.
- Crowdsourcing detection of sampling biases in image datasets. In Proceedings of The Web Conference 2020, WWW ’20, page 2955–2961, New York, NY, USA. Association for Computing Machinery.
- Chatgpt for shaping the future of dentistry: the potential of multi-modal large language model. International Journal of Oral Science, 15(1):29.
- Tag2text: Guiding vision-language model via image tagging.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR.
- The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
- Geonet: Benchmarking unsupervised adaptation across geographies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15368–15379.
- Towards a fairer landmark recognition dataset. ArXiv, abs/2108.08874.
- Segment anything. ArXiv, abs/2304.02643.
- Segment anything. arXiv preprint arXiv:2304.02643.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning.
- VisualBERT: A simple and performant baseline for vision and language. ArXiv, abs/1908.03557.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer.
- Robert Munro Monarch. 2021. Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI. Simon and Schuster.
- Bridging the digital divide: Performance variation across socio-economic factors in vision-language models. In The 2023 Conference on Empirical Methods in Natural Language Processing.
- Realistic evaluation of deep semi-supervised learning algorithms. Advances in neural information processing systems, 31.
- Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24.
- Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing. Computational Linguistics, 45(3):559–601.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning.
- Beyond web-scraping: Crowd-sourcing a geographically diverse image dataset. ArXiv, abs/2301.02560.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
- A survey of deep active learning. ACM computing surveys (CSUR), 54(9):1–40.
- Imagenet-21k pretraining for the masses. ArXiv, abs/2104.10972.
- The dollar street dataset: Images representing the geographic and socioeconomic diversity of the world. In Neural Information Processing Systems.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494.
- When does bias transfer in transfer learning? arXiv preprint arXiv:2207.02842.
- Laion-5b: An open large-scale dataset for training next generation image-text models. ArXiv, abs/2210.08402.
- LAION-400M: Open dataset of CLIP-filtered 400 million image-text pairs. In Proceedings of the NeurIPS Data Centric AI Workshop.
- A step toward more inclusive people annotations for fairness. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 916–925.
- Ozan Sener and Silvio Savarese. 2018. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations.
- No classification without representation: Assessing geodiversity issues in open data sets for the developing world. arXiv: Machine Learning.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565.
- Eva-clip: Improved training techniques for clip at scale.
- Kamal Taha. 2023. Semi-supervised and un-supervised clustering: A review and experimental evaluation. Information Systems, page 102178.
- Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR.
- Thaddeus Vincenty. 1975. Direct and inverse solutions of geodesics on the ellipsoid with application of nested equations. Scientific Research and Essays, 23:88–93.
- Revise: A tool for measuring and mitigating bias in visual datasets. International Journal of Computer Vision, 130:1790 – 1810.
- Angelina Wang and Olga Russakovsky. 2023. Overcoming bias in pretrained models by manipulating the finetuning dataset. arXiv preprint arXiv:2303.06167.
- Unsupervised selective labeling for more effective semi-supervised learning. In European Conference on Computer Vision.
- Suggestive annotation: A deep active learning framework for biomedical image segmentation. In Medical Image Computing and Computer Assisted Intervention- MICCAI 2017: 20th International Conference, Quebec City, QC, Canada, September 11-13, 2017, Proceedings, Part III 20, pages 399–407. Springer.
- Merlot reserve: Neural script knowledge through vision and language and sound. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16354–16366.
- Merlot: Multimodal neural script knowledge models. In Neural Information Processing Systems.
- VinVL: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5579–5588.
- Recognize anything: A strong image tagging model.
- Men also do laundry: Multi-attribute bias amplification. In International Conference on Machine Learning, pages 42000–42017. PMLR.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.