Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cross-Modal Projection in Multimodal LLMs Doesn't Really Project Visual Attributes to Textual Space

Published 26 Feb 2024 in cs.CL, cs.AI, and cs.CV | (2402.16832v2)

Abstract: Multimodal LLMs (MLLMs) like LLaVA and GPT-4(V) enable general-purpose conversations about images with the language modality. As off-the-shelf MLLMs may have limited capabilities on images from domains like dermatology and agriculture, they must be fine-tuned to unlock domain-specific applications. The prevalent architecture of current open-source MLLMs comprises two major modules: an image-language (cross-modal) projection network and a LLM. It is desirable to understand the roles of these two modules in modeling domain-specific visual attributes to inform the design of future models and streamline the interpretability efforts on the current models. To this end, via experiments on 4 datasets and under 2 fine-tuning settings, we find that as the MLLM is fine-tuned, it indeed gains domain-specific visual capabilities, but the updates do not lead to the projection extracting relevant domain-specific visual attributes. Our results indicate that the domain-specific visual attributes are modeled by the LLM, even when only the projection is fine-tuned. Through this study, we offer a potential reinterpretation of the role of cross-modal projections in MLLM architectures. Project webpage: https://claws-lab.github.io/projection-in-MLLMs/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Abien Fred Agarap. 2018. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375.
  3. Landing AI. 2024. Introducing domain-specific large vision models. https://landing.ai/blog/introducing-domain-specific-large-vision-models/. Accessed: 2024-02-14.
  4. Crisismmd: Multimodal twitter datasets from natural disasters. In Proceedings of the 12th International AAAI Conference on Web and Social Media (ICWSM).
  5. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
  6. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  7. Automatic defect classification (adc) solution using data-centric artificial intelligence (ai) for outgoing quality inspections in the semiconductor industry. In Metrology, Inspection, and Process Control XXXVII, volume 12496, pages 830–836. SPIE.
  8. Learning representations by maximizing mutual information across views. Advances in neural information processing systems, 32.
  9. Unsupervised feature learning and deep learning: A review and new perspectives. CoRR, abs/1206.5538, 1(2665):2012.
  10. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
  11. Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500.
  12. Konstantinos P Ferentinos. 2018. Deep learning models for plant disease detection and diagnosis. Computers and electronics in agriculture, 145:311–318.
  13. Mllm-bench, evaluating multi-modal llms using gpt-4v. arXiv preprint arXiv:2311.13951.
  14. Multimodal neurons in artificial neural networks. Distill, 6(3):e30.
  15. A vision transformer model for convolution-free multilabel classification of satellite imagery in deforestation monitoring. IEEE Transactions on Neural Networks and Learning Systems.
  16. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  17. Concept bottleneck models. In International conference on machine learning, pages 5338–5348. PMLR.
  18. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890.
  19. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122.
  20. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566.
  21. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
  22. Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems.
  23. Deep learning for healthcare: review, opportunities and challenges. Briefings in bioinformatics, 19(6):1236–1246.
  24. Anymal: An efficient and scalable any-modality augmented language model. arXiv preprint arXiv:2309.16058.
  25. Bridging the digital divide: Performance variation across socio-economic factors in vision-language models. arXiv preprint arXiv:2311.05746.
  26. Analysis of social media data using multimodal deep learning for disaster response. In 17th International Conference on Information Systems for Crisis Response and Management. ISCRAM, ISCRAM.
  27. Finding and editing multi-modal neurons in pre-trained transformer. arXiv preprint arXiv:2311.07470.
  28. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  29. Derm-nn: skin diseases detection using convolutional neural network. In 2020 4th International Conference on Intelligent Computing and Control Systems (ICICCS), pages 1205–1209. IEEE.
  30. Multimodal neurons in pretrained text-only transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2862–2867.
  31. Plantdoc: A dataset for visual plant disease detection. In Proceedings of the 7th ACM IKDD CoDS and 25th COMAD, pages 249–253.
  32. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  33. "kelly is a warm person, joseph is a role model": Gender biases in llm-generated reference letters. arXiv preprint arXiv:2310.09219.
  34. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  35. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490.
Citations (7)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

GitHub