Baichuan-Omni-1.5 Technical Report

Published 26 Jan 2025 in cs.CL, cs.SD, and eess.AS | (2501.15368v1)

Abstract: We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data (text, audio, and vision). Second, an audio-tokenizer (Baichuan-Audio-Tokenizer) has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM. Lastly, we designed a multi-stage training strategy that progressively integrates multimodal alignment and multitask fine-tuning, ensuring effective synergy across all modalities. Baichuan-Omni-1.5 leads contemporary models (including GPT4o-mini and MiniCPM-o 2.6) in terms of comprehensive omni-modal capabilities. Notably, it achieves results comparable to leading models such as Qwen2-VL-72B across various multimodal medical benchmarks.

Abstract PDF Upgrade to Chat

Summary

The paper introduces Baichuan-Omni-1.5, a model that integrates end-to-end audio generation with robust visual and textual processing using a multi-stage training strategy.
It employs a custom Baichuan-Audio-Tokenizer and flow matching-based decoder to capture both semantic and acoustic details, enhancing multimodal interactions.
It achieves superior performance on benchmarks like OpenMM-Medical (83.8%) and MMLU (72.2%), demonstrating its competitive edge over similar omni-modal models.

The paper introduces Baichuan-Omni-1.5, a novel omni-modal model featuring end-to-end audio generation capabilities. The model leverages approximately 500B tokens of multimodal data, an audio-tokenizer (Baichuan-Audio-Tokenizer), and a multi-stage training strategy to achieve seamless, high-quality interaction across modalities without compromising individual modality performance. Baichuan-Omni-1.5 exhibits competitive performance, rivaling models like Qwen2-VL-72B, particularly in multimodal medical benchmarks.

The key components and contributions include:

A comprehensive data cleaning and synthesis pipeline for multimodal data.
Baichuan-Audio-Tokenizer to capture both semantic and acoustic information from audio, enhancing compatibility with MLLMs.
A multi-stage training strategy for effective synergy across all modalities.
OpenAudio-Bench, an open-source audio understanding and generation benchmark for evaluating end-to-end audio capabilities.
OpenMM-Medical, a comprehensive medical understanding benchmark, achieving SOTA performance on it using a 7B LLM, surpassing Qwen2-VL-72B's score of 80.7\% with a score of 83.8\%.

The architecture of Baichuan-Omni-1.5 comprises a visual branch, an audio branch, and a pre-trained LLM backbone. The visual branch employs NaViT, similar to Qwen2-VL, for processing image and video inputs, along with a two-layer MLP visual projector. The audio branch incorporates the Baichuan-Audio-Tokenizer and a flow matching-based decoder for end-to-end speech processing. The Baichuan-Audio-Tokenizer is based on Residual Vector Quantization (RVQ) with a frame rate of 12.5 Hz.

The training strategy involves a multi-stage approach:

Image-Text Pretrain: Extends an LLM to process and understand visual input.
Image-Audio-Text Pretrain: Expands the LLM to understand audio data in an end-to-end manner, incorporating the Baichuan-Audio-Tokenizer, a newly introduced audio embedding layer, and an independent audio head.
Omni-Modal Pretrain: Trains all parameters using high-quality cross-modal interaction datasets, extending the maximum sequence length to 64k.
Multimodal Supervised Fine-Tuning (SFT): Enhances the model's instruction-following capabilities across a range of tasks, utilizing a dataset of approximately 17 million data pairs across various modalities.

The model was evaluated against proprietary models such as GPT4o mini and GPT4o, open-source general models like MAP-Neo, Qwen1.5-Chat, Llama3-Instruct, OLMo, and open-source omni-modal models such as VITA-1.0, VITA-1.5, Baichuan-Omni, and MiniCPM-o 2.6 across text, image, video, audio, medical, and omni benchmarks.

The evaluation includes the following benchmarks: MMLU, CMMLU, AGIEval, C-Eval, and GAOKAO-Bench. Baichuan-Omni-1.5 demonstrates strong performance on pure-text benchmarks. For example, on MMLU, Llama3-Instruct achieves 67.1\%, while Baichuan-Omni-1.5 reaches 72.2\%.

The evaluation includes the following benchmarks: MMBench-EN, MMBench-CN, SEEDBench, RealWorldQA, MMMU, MathVista, TextVQA, OCRBench, ChartQA, and HallusionBench. The model outperforms the latest open-source models, VITA-1.5 and MiniCPM-o 2.6, on most benchmarks.

The evaluation includes the following benchmarks: Perception-Test, MVBench, VideoMME, and EgoSchema, ActivityNet-QA and MSVD-QA. Baichuan-Omni-1.5 demonstrates comparable performance to proprietary models on benchmarks like Egoschema and VideoMME, and achieves strong performance across open-source multimodal models.

The evaluation includes the following benchmarks: OpenAudioBench which includes Reasoning QA, Spoken Llama Questions, Web Questions, TriviaQA, and AlpacaEval. In the s $\rightarrow$ t setting, Baichuan-Omni-1.5 significantly outperforms models of the same size in Reasoning QA and AlpacaEval, achieving scores of 50 and 7.79, respectively.

The evaluation includes the OmniBench benchmark with the following setups: 1) Image + Audio, 2) Image Caption + Audio, 3) Image + Audio Transcript, 4) Image Caption + Audio Transcript. Compared to the omni-modal model MiniCPM-o 2.6, Baichuan-Omni-1.5 outperforms it in three of the four settings.

The evaluation includes GMAI-MMBench and OpenMM-Medical. Baichuan-Omni-1.5 achieves the highest performance in both. On OpenMM-Medical, MiniCPM-o 2.6 gets 73.6\%, while Baichuan-Omni-1.5 gets 83.8%.