Analyzing the Roles of Language and Vision in Learning from Limited Data
Abstract: Does language help make sense of the visual world? How important is it to actually see the world rather than having it described with words? These basic questions about the nature of intelligence have been difficult to answer because we only had one example of an intelligent system -- humans -- and limited access to cases that isolated language or vision. However, the development of sophisticated Vision-LLMs (VLMs) by artificial intelligence researchers offers us new opportunities to explore the contributions that language and vision make to learning about the world. We ablate components from the cognitive architecture of these models to identify their contributions to learning new tasks from limited data. We find that a LLM leveraging all components recovers a majority of a VLM's performance, despite its lack of visual input, and that language seems to allow this by providing access to prior knowledge and reasoning.
- (2022). Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. arXiv preprint arXiv:2204.01691.
- (2001). Visual imagery without visual experience: evidence from congenitally totally blind people. Neuroreport, 12(11), 2601–2604.
- Anderson, J. R. (2013). The Architecture of Cognition. Psychology Press.
- (2021). PASS: An ImageNet replacement for self-supervised pretraining without humans. NeurIPS Track on Datasets and Benchmarks.
- (2020, November). Self-Supervised Meta-Learning for Few-Shot Natural Language Classification Tasks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 522–534). Association for Computational Linguistics.
- (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
- (2005). Learning a Similarity Metric Discriminatively, with Application to Face Verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR ’05) (Vol. 1, pp. 539–546).
- (2007). Effect of congenital blindness on the semantic representation of some everyday concepts. Proceedings of the National Academy of Sciences, 104(20), 8241–8246.
- (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
- (2023). PaLM-E: An Embodied Multimodal Language Model. arXiv preprint arXiv:2303.03378.
- (2022). Data Determines Distributional Robustness in Contrastive Language Image Pre-Training (CLIP). In International Conference on Machine Learning (pp. 6216–6234).
- (2023). PoseGPT: Chatting about 3D Human Pose. arXiv preprint arXiv:2311.18836.
- Fodor, J. A. (1975). The Language of Thought (Vol. 5). Harvard University Press.
- (2021, August). Making Pre-trained Language Models Better Few-shot Learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 3816–3830). Association for Computational Linguistics.
- (2015). New and Improved Embedding Model — OpenAI.com. https://openai.com/blog/new-and-improved-embedding-model. ([Accessed 24-01-2024])
- (2023). Texts as Images in Prompt Tuning for Multi-Label Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2808–2817).
- (2006). Dimensionality Reduction by Learning an Invariant Mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR ’06) (Vol. 2, pp. 1735–1742).
- (2016). Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770–778).
- (2021, April). Language Models as Knowledge Bases: On Entity Representations, Storage Capacity, and Paraphrased Queries. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 1772–1791). Association for Computational Linguistics.
- (2021). Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In International Conference on Machine Learning (pp. 4904–4916).
- Kerr, N. H. (1983). The role of vision in “visual imagery” experiments: Evidence from the congenitally blind. Journal of Experimental Psychology: General, 112(2), 265-277.
- (2021). Shared understanding of color among sighted and blind adults. Proceedings of the National Academy of Sciences, 118(33), e2020192118.
- (2022). Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. In International Conference on Learning Representations.
- (2023). Visual Instruction Tuning. In NeurIPS.
- (2022). Predicting Human Similarity Judgments Using Large Language Models. In Proceedings of the Annual Meeting of the Cognitive Science Society (Vol. 44).
- (2023). What Language Reveals about Perception: Distilling Psychophysical Knowledge from Large Language Models. In Proceedings of the 45th Annual Conference of the Cognitive Science Society.
- (2022). Words are all you need? Language as an approximation for human similarity judgments. In The Eleventh International Conference on Learning Representations.
- (1972). Human Problem Solving (Vol. 104) (No. 9). Prentice-hall Englewood Cliffs, NJ.
- OpenAI. (2023). GPT-4 Technical Report. arXiv preprint arXiv:2303.08774.
- (2019, November). Language Models as Knowledge Bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 2463–2473). Hong Kong, China: Association for Computational Linguistics.
- (2021, 18–24 Jul). Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (Vol. 139, pp. 8748–8763). PMLR.
- (2018). Improving Language Understanding by Generative Pre-Training.
- (2019). Language Models are Unsupervised Multitask Learners. OpenAI blog, 1(8), 9.
- (2022). DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 18082–18091).
- (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3), 211-252. doi: 10.1007/s11263-015-0816-y
- (2022). FLAVA: A Foundational Language And Vision Alignment Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 15638–15650).
- (2023). Cognitive Architectures for Language Agents. arXiv preprint arXiv:2309.02427.
- Sun, R. (2024). Can A Cognitive Architecture Fundamentally Enhance LLMs? Or Vice Versa? arXiv preprint arXiv:2401.10444.
- (2022). DualCoOP: Fast Adaptation to Multi-Label Recognition with Limited Annotations. Advances in Neural Information Processing Systems, 35, 30569–30582.
- (2023). LLaMa: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971.
- (2017). Attention is All You Need. Advances in Neural Information Processing Systems, 30, 5998–6008.
- Wirth, N. (1995). A Plea for Lean Software. Computer, 28(2), 64–68.
- (2023). Zero-Shot Action Recognition with ChatGPT-Based Instruction. In International Workshop on Advanced Computational Intelligence and Intelligent Informatics (pp. 18–28).
- (2023). AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos? arXiv preprint arXiv:2307.16368.
- (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- (1983). Imagery in the congenitally blind: How visual are visual images? Journal of Experimental Psychology: Learning, Memory, and Cognition, 9(2), 269–282.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.