Papers
Topics
Authors
Recent
Search
2000 character limit reached

MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples

Published 11 Dec 2023 in cs.AI, cs.CL, and cs.LG | (2312.06363v3)

Abstract: Although In-Context Learning (ICL) brings remarkable performance gains to LLMs, the improvements remain lower than fine-tuning on downstream tasks. This paper introduces Multi-Modal In-Context Tuning (MMICT), a novel multi-modal fine-tuning paradigm that boosts multi-modal fine-tuning by fully leveraging the promising ICL capability of multi-modal LLMs (MM-LLMs). We propose the Multi-Modal Hub (M-Hub), a unified module that captures various multi-modal features according to different inputs and objectives. Based on M-Hub, MMICT enables MM-LLMs to learn from in-context visual-guided textual features and subsequently generate outputs conditioned on the textual-guided visual features. Moreover, leveraging the flexibility of M-Hub, we design a variety of in-context demonstrations. Extensive experiments on a diverse range of downstream multi-modal tasks demonstrate that MMICT significantly outperforms traditional fine-tuning strategy and the vanilla ICT method that directly takes the concatenation of all information from different modalities as input. Our implementation is available at: https://github.com/KDEGroup/MMICT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Flamingo: a Visual Language Model for Few-Shot Learning. In NeurIPS, 23716–23736.
  2. OpenFlamingo.
  3. VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning. In CVPR, 18009–18019.
  4. Improving In-Context Few-Shot Learning via Self-Supervised Training. In NAACL-HLT, 3558–3573.
  5. VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset. arXiv Preprint. https://arxiv.org/abs/2305.18500.
  6. Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts. arXiv Preprint. https://arxiv.org/abs/2302.08958.
  7. Scaling Instruction-Finetuned Language Models. arXiv Preprint. https://arxiv.org/abs/2210.11416.
  8. Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models. arXiv Preprint. https://arxiv.org/abs/2203.06904.
  9. A Survey on In-context Learning. arXiv Preprint. https://arxiv.org/abs/2301.00234.
  10. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale. arXiv Preprint. https://arxiv.org/abs/2211.07636.
  11. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model. arXiv Preprint. https://arxiv.org/abs/2304.15010.
  12. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In CVPR, 6325–6334.
  13. Pre-Training to Learn in Context. In ACL, volume 1, 4849–4870.
  14. Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting. arXiv Preprint. https://arxiv.org/abs/2204.07841.
  15. VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending. arXiv Preprint. https://arxiv.org/abs/2305.13167.
  16. Parameter-Efficient Transfer Learning for NLP. In ICML, volume 97, 2790–2799.
  17. LoRA: Low-Rank Adaptation of Large Language Models. In ICLR.
  18. Language Is Not All You Need: Aligning Perception with Language Models. arXiv Preprint. https://arxiv.org/abs/2302.14045.
  19. The Power of Scale for Parameter-Efficient Prompt Tuning. In EMNLP, volume 1, 3045–3059.
  20. MIMIC-IT: Multi-Modal In-Context Instruction Tuning. arXiv Preprint. https://arxiv.org/abs/2306.05425.
  21. Otter: A Multi-Modal Model with In-Context Instruction Tuning. arXiv Preprint. https://arxiv.org/abs/2305.03726.
  22. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv Preprint. https://arxiv.org/abs/2301.12597.
  23. Finding Supporting Examples for In-Context Learning. arXiv Preprint. https://arxiv.org/abs/2302.13539.
  24. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In ACL/IJCNLP, volume 1, 4582–4597.
  25. Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning. arXiv Preprint. https://arxiv.org/abs/2303.15647.
  26. Microsoft COCO: Common Objects in Context. In ECCV, volume 8693, 740–755.
  27. What Makes Good In-Context Examples for GPT-3? In DeeLIO@ACL, 100–114.
  28. GPT Understands, Too. arXiv Preprint. https://arxiv.org/abs/2103.10385.
  29. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. In ACL, volume 1, 8086–8098.
  30. Compacter: Efficient Low-Rank Hypercomplex Adapter Layers. In NeurIPS, 1022–1035.
  31. MetaICL: Learning to Learn In Context. In NAACL-HLT, 2791–2809.
  32. OpenAI. 2023. GPT-4 Technical Report. arXiv Preprint. https://arxiv.org/abs/2303.08774.
  33. Bleu: a Method for Automatic Evaluation of Machine Translation. In ACL, 311–318.
  34. Multimodal Few-Shot Learning with Frozen Language Models. In NeurIPS, 200–212.
  35. CIDEr: Consensus-based image description evaluation. In CVPR, 4566–4575.
  36. Large Language Models Are Implicitly Topic Models: Explaining and Finding Good Demonstrations for In-Context Learning. arXiv Preprint. https://arxiv.org/abs/2301.11916.
  37. Self-Instruct: Aligning Language Models with Self-Generated Instructions. In ACL, volume 1, 13484–13508.
  38. Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks. In EMNLP, 5085–5109.
  39. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS, 24824–24837.
  40. Symbol tuning improves in-context learning in language models. arXiv Preprint. https://arxiv.org/abs/2305.08298.
  41. Small Models are Valuable Plug-ins for Large Language Models. arXiv Preprint. https://arxiv.org/abs/2305.08848.
  42. Video Question Answering via Gradually Refined Attention over Appearance and Motion. In ACM Multimedia, 1645–1653.
  43. mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video. arXiv Preprint. https://arxiv.org/abs/2302.00402.
  44. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In CVPR, 5288–5296.
  45. MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning. In ACL, volume 1, 11445–11465.
  46. mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. arXiv Preprint. https://arxiv.org/abs/2304.14178.
  47. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. In ACL, volume 3, 1–9.
  48. Transfer Visual Prompt Generator across LLMs. arXiv Preprint. https://arxiv.org/abs/2305.01278.
  49. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. arXiv Preprint. https://arxiv.org/abs/2306.02858.
  50. OPT: Open Pre-trained Transformer Language Models. arXiv Preprint. https://arxiv.org/abs/2205.01068.
  51. A Survey of Large Language Models. arXiv Preprint. https://arxiv.org/abs/2303.18223.
  52. Calibrate Before Use: Improving Few-shot Performance of Language Models. In ICML, volume 139, 12697–12706.
  53. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv Preprint. https://arxiv.org/abs/2304.10592.
Citations (2)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.