MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
Abstract: Visual preference alignment involves training Large Vision-LLMs (LVLMs) to predict human preferences between visual inputs. This is typically achieved by using labeled datasets of chosen/rejected pairs and employing optimization algorithms like direct preference optimization (DPO). Existing visual alignment methods, primarily designed for single-image scenarios, struggle to effectively handle the complexity of multi-image tasks due to the scarcity of diverse training data and the high cost of annotating chosen/rejected pairs. We present Multi-Image Augmented Direct Preference Optimization (MIA-DPO), a visual preference alignment approach that effectively handles multi-image inputs. MIA-DPO mitigates the scarcity of diverse multi-image training data by extending single-image data with unrelated images arranged in grid collages or pic-in-pic formats, significantly reducing the costs associated with multi-image data annotations. Our observation reveals that attention values of LVLMs vary considerably across different images. We use attention values to identify and filter out rejected responses the model may have mistakenly focused on. Our attention-aware selection for constructing the chosen/rejected pairs without relying on (i) human annotation, (ii) extra data, and (iii) external models or APIs. MIA-DPO is compatible with various architectures and outperforms existing methods on five multi-image benchmarks, achieving an average performance boost of 3.0% on LLaVA-v1.5 and 4.3% on the recent InternLM-XC2.5. Moreover, MIA-DPO has a minimal effect on the model's ability to understand single images.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- A survey on data selection for language models. arXiv preprint arXiv:2402.16827, 2024.
- OpenFlamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023a.
- Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023b.
- MINT-1T: Scaling open-source multimodal data by 10x: A multimodal dataset with one trillion tokens. arXiv preprint arXiv:2406.11271, 2024.
- Qwen-VL: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
- Constitutional AI: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
- Introducing our multimodal models, 2023.
- Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024a.
- Sharegpt4video: Improving video understanding and generation with better captions. arXiv preprint arXiv:2406.04325, 2024b.
- InstructBLIP: Towards general-purpose vision-language models with instruction tuning, 2023.
- Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. arXiv preprint arXiv:2407.11691, 2024.
- The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024.
- BLINK: Multimodal large language models can see but not perceive, 2024.
- GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6700–6709, 2019.
- MANTIS: Interleaved multi-image instruction tuning, 2024.
- A diagram is worth a dozen images. In ECCV, 2016.
- OBELICS: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems, 2024a.
- What matters when building vision-language models? arXiv preprint arXiv:2405.02246, 2024b.
- Otter: A multi-modal model with in-context instruction tuning, 2023a.
- LLaVA-NeXT-Interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895, 2024a.
- Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895, 2024b.
- BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023b.
- MVBench: A comprehensive multi-modal video understanding benchmark. In CVPR, 2024c.
- Evaluating object hallucination in large vision-language models. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023c.
- Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.
- VILA: On pre-training for visual language models. In CVPR, 2024.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Improved baselines with visual instruction tuning. In CVPR, 2024a.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2024b.
- MMBench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
- Mmdu: A multi-turn multi-image dialog understanding benchmark and instruction-tuning dataset for lvlms. arXiv preprint arXiv:2406.11833, 2024c.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 2022.
- MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.
- MMLongBench-Doc: Benchmarking long-context document understanding with visualizations. arXiv preprint arXiv:2407.01523, 2024.
- Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019.
- OpenAI. Hello gpt-4o, 2024. URL https://openai.com/index/hello-gpt-4o/.
- CLIP-DPO: Vision-language models as a source of preference for fixing hallucinations in lvlms. In ECCV, 2024.
- Training language models to follow instructions with human feedback. In NeurIPS, 2022.
- Iterative reasoning preference optimization. arXiv preprint arXiv:2404.19733, 2024.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
- Direct preference optimization: Your language model is secretly a reward model. In NeurIPS, 2024.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Towards vqa models that can read. In CVPR, 2019.
- A long way to go: Investigating length correlations in rlhf. arXiv preprint arXiv:2310.03716, 2023.
- Milebench: Benchmarking mllms in long context. arXiv preprint arXiv:2404.18532, 2024.
- A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491, 2018.
- Generative multimodal models are in-context learners. In CVPR, 2024.
- Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023.
- CogVLM: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
- RlHF-V: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In CVPR, 2024a.
- RLAIF-V: Aligning mllms through open-source ai feedback for super gpt-4v trustworthiness. arXiv preprint arXiv:2405.17220, 2024b.
- MM-Vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
- MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In CVPR, 2024.
- Internlm-Xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output. arXiv preprint arXiv:2407.03320, 2024.
- Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839, 2023.
- Aligning modalities in vision large language models via preference fine-tuning. arXiv preprint arXiv:2402.11411, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.