Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling
Abstract: Today's most accurate LLMs are trained on orders of magnitude more language data than human language learners receive - but with no supervision from other sensory modalities that play a crucial role in human learning. Can we make LMs' representations and predictions more accurate (and more human-like) with more ecologically plausible supervision? This paper describes LexiContrastive Grounding (LCG), a grounded language learning procedure that leverages visual supervision to improve textual representations. LexiContrastive Grounding combines a next token prediction strategy with a contrastive visual grounding objective, focusing on early-layer representations that encode lexical information. Across multiple word-learning and sentence-understanding benchmarks, LexiContrastive Grounding not only outperforms standard language-only models in learning efficiency, but also improves upon vision-and-language learning procedures including CLIP, GIT, Flamingo, and Vokenization. Moreover, LexiContrastive Grounding improves perplexity by around 5% on multiple language modeling tasks. This work underscores the potential of incorporating visual grounding into LLMs, aligning more closely with the multimodal nature of human language acquisition.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
- Suhas Arehalli and Tal Linzen. 2020. Neural language models capture some, but not all, agreement attraction effects.
- A computational acquisition model for multimodal word categorization. arXiv preprint arXiv:2205.05974.
- Experience grounds language. arXiv preprint arXiv:2004.10151.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 136–145.
- Concreteness ratings for 40 thousand generally known english word lemmas. Behavior research methods, 46:904–911.
- English semantic feature production norms: An extended database of 4436 concepts. Behavior Research Methods, 51:1849–1863.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660.
- Charlotte Caucheteux and Jean-Rémi King. 2022. Brains and algorithms partially converge in natural language processing. Communications biology, 5(1):134.
- Tyler A Chang and Benjamin K Bergen. 2022. Word acquisition in neural language models. Transactions of the Association for Computational Linguistics, 10:1–16.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568.
- A method for studying semantic construal in grammatical constructions with interpretable contextual embedding spaces. arXiv preprint arXiv:2305.18598.
- Real-world visual statistics and infants’ first-learned object names. Philosophical Transactions of the Royal Society B: Biological Sciences, 372(1711):20160055.
- Imagenet: A large-scale hierarchical image database.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Michael C Frank. 2023. Bridging the data gap between children and large language models. Trends in Cognitive Sciences.
- Wordbank: An open repository for developmental vocabulary data. Journal of child language, 44(3):677–694.
- Simverb-3500: A large-scale evaluation set of verb similarity. arXiv preprint arXiv:1608.00869.
- Shared computational principles for language processing in humans and deep language models. Nature neuroscience, 25(3):369–380.
- Babyberta: Learning more grammar with small-scale child-directed language. In Proceedings of the 25th conference on computational natural language learning, pages 624–646.
- Talia Konkle and George A Alvarez. 2022. A self-supervised domain-general learning framework for human ventral stream representation. Nature communications, 13(1):491.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73.
- Age-of-acquisition ratings for 30,000 english words. Behavior research methods, 44:978–990.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916.
- Brian MacWhinney. 2014. The CHILDES project: Tools for analyzing talk, Volume II: The database. Psychology Press.
- Learning the meanings of function words from grounded language using a visual question answering model. arXiv preprint arXiv:2308.08628.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- The cogalex-v shared task on the corpus-based identification of semantic relations. In Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex-V), pages 69–79.
- The neural architecture of language: Integrative modeling converges on predictive processing. Proceedings of the National Academy of Sciences, 118(45):e2105646118.
- Sara E Schroer and Chen Yu. 2023. Looking is not enough: Multimodal attention supports the real-time learning of new words. Developmental Science, 26(2):e13290.
- Touch to learn: Multisensory input supports word learning and processing. Developmental Science, page e13419.
- Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650.
- Saycam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective.
- Hao Tan and Mohit Bansal. 2020. Vokenization: Improving language understanding with contextualized, visual-grounded supervision. arXiv preprint arXiv:2010.06775.
- Distilling relation embeddings from pre-trained language models. arXiv preprint arXiv:2110.15705.
- Attention is all you need. Advances in neural information processing systems, 30.
- Grounded language acquisition through the eyes and ears of a single child. Science, 383(6682):504–511.
- Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100.
- Alex Warstadt and Samuel R Bowman. 2022. What artificial neural networks can tell us about human language acquisition. Algebraic Structures in Natural Language, pages 17–60.
- Findings of the babylm challenge: Sample-efficient pretraining on developmentally plausible corpora. In Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning.
- Kelsey L West and Jana M Iverson. 2017. Language learning is hands-on: Exploring links between infants’ object manipulation and verbal input. Cognitive Development, 43:190–200.
- On the predictive power of neural language models for human real-time comprehension behavior. arXiv preprint arXiv:2006.01912.
- A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45.
- Robert Wolfe and Aylin Caliskan. 2022. Contrastive visual semantic pretraining magnifies the semantics of natural language representations. arXiv preprint arXiv:2203.07511.
- When do you need billions of words of pretraining data? arXiv preprint arXiv:2011.04946.
- Visual grounding helps learn word meanings in low-data regimes. arXiv preprint arXiv:2310.13257.
- How well do unsupervised learning algorithms model human real-time and life-long learning? Advances in Neural Information Processing Systems, 35:22628–22642.
- Unsupervised neural network models of the ventral visual stream. Proceedings of the National Academy of Sciences, 118(3).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.