An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance
Abstract: Given the rise of multimedia content, human translators increasingly focus on culturally adapting not only words but also other modalities such as images to convey the same meaning. While several applications stand to benefit from this, machine translation systems remain confined to dealing with language in speech and text. In this work, we take a first step towards translating images to make them culturally relevant. First, we build three pipelines comprising state-of-the-art generative models to do the task. Next, we build a two-part evaluation dataset: i) concept: comprising 600 images that are cross-culturally coherent, focusing on a single concept per image, and ii) application: comprising 100 images curated from real-world applications. We conduct a multi-faceted human evaluation of translated images to assess for cultural relevance and meaning preservation. We find that as of today, image-editing models fail at this task, but can be improved by leveraging LLMs and retrievers in the loop. Best pipelines can only translate 5% of images for some countries in the easier concept dataset and no translation is successful for some countries in the application dataset, highlighting the challenging nature of the task. Our code and data is released here: https://github.com/simran-khanuja/image-transcreation.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Probing pre-trained language models for cross-cultural differences in values. arXiv preprint arXiv:2203.13722.
- Inspecting the geographical representativeness of images from text-to-image models. arXiv preprint arXiv:2305.11080.
- Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402.
- Zeno: An interactive framework for behavioral evaluation of machine learning. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–14.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660.
- Frederic Chaume. 2018. Is audiovisual translation putting the concept of translation up against the ropes?
- Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8789–8797.
- National Research Council et al. 2015. Transforming the workforce for children birth through age 8: A unifying foundation.
- John Dryden. 1694. Preface to examen poeticum. In Examen Poeticum.
- Media across borders: Localising TV, film and video games. Routledge.
- A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576.
- Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423.
- Stuart Hall. 2015. Cultural identity and diaspora. In Colonial discourse and post-colonial theory, pages 392–403. Routledge.
- Implications for educational practice of the science of learning and development. Applied Developmental Science, 24(2):97–140.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626.
- George Ho. 2016. Translating advertisements across heterogeneous cultures. In Key Debates in the Translation of Advertising Material, pages 221–243. Routledge.
- Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (ECCV), pages 172–189.
- Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134.
- Roman Jakobson. 1959. On linguistic aspects of translation. Harvard Educational Review, 29(1):232–239.
- Jerome. 384. Letter to pammachius. Translated in Kelly, J. N. (Ed.) (2009). Jerome: Letters (Vol. 1). Oxford University Press.
- Heidi Keinonen. 2016. Cultural negotiation in an early programme format: the finnish adaptation of romper room. New Patterns in Global Television Formats. Bristol: Intellect, pages 95–108.
- Mary Ritchie Key and editors Bernard Comrie. 2015. Ids. Max Planck Institute for Evolutionary Anthropology, Leipzig.
- Ibn Khaldun. 1377. The Muqaddimah: An introduction to history.
- Bloom library: Multimodal datasets in 300+ languages for a variety of downstream tasks. arXiv preprint arXiv:2210.14712.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- Crossing the threshold: Idiomatic machine translation through retrieval augmentation and loss weighting. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15095–15111, Singapore. Association for Computational Linguistics.
- Visually grounded reasoning across languages and cultures. arXiv preprint arXiv:2109.13238.
- Albert Moran. 2009. Global franchising, local customizing: The cultural economy of tv program formats. Continuum, 23(2):115–125.
- Eugene A. Nida. 1964. Principles of correspondence in translating. Summer Institute of Linguistics.
- Few-shot image generation via cross-domain correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10743–10752.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
- Nathalie Ramière. 2010. Are you" lost in translation"(when watching a foreign film)? towards an alternative approach to judging audiovisual translation. Australian Journal of French Studies, 47(1):100–115.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294.
- Juan José MartÃnez Sierra. 2008. Humor y traducción: Los Simpson cruzan la frontera. 15. Universitat Jaume I.
- Jeanette Steemers and Alessandro D’Arma. 2012. Evaluating and regulating the role of public broadcasters in the children’s media ecology: The case of home-grown television content. International Journal of Media & Cultural Politics, 8(1):67–85.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.