Papers
Topics
Authors
Recent
Search
2000 character limit reached

Baichuan-Omni-1.5 Technical Report

Published 26 Jan 2025 in cs.CL, cs.SD, and eess.AS | (2501.15368v1)

Abstract: We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data (text, audio, and vision). Second, an audio-tokenizer (Baichuan-Audio-Tokenizer) has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM. Lastly, we designed a multi-stage training strategy that progressively integrates multimodal alignment and multitask fine-tuning, ensuring effective synergy across all modalities. Baichuan-Omni-1.5 leads contemporary models (including GPT4o-mini and MiniCPM-o 2.6) in terms of comprehensive omni-modal capabilities. Notably, it achieves results comparable to leading models such as Qwen2-VL-72B across various multimodal medical benchmarks.

Summary

  • The paper introduces Baichuan-Omni-1.5, a model that integrates end-to-end audio generation with robust visual and textual processing using a multi-stage training strategy.
  • It employs a custom Baichuan-Audio-Tokenizer and flow matching-based decoder to capture both semantic and acoustic details, enhancing multimodal interactions.
  • It achieves superior performance on benchmarks like OpenMM-Medical (83.8%) and MMLU (72.2%), demonstrating its competitive edge over similar omni-modal models.

The paper introduces Baichuan-Omni-1.5, a novel omni-modal model featuring end-to-end audio generation capabilities. The model leverages approximately 500B tokens of multimodal data, an audio-tokenizer (Baichuan-Audio-Tokenizer), and a multi-stage training strategy to achieve seamless, high-quality interaction across modalities without compromising individual modality performance. Baichuan-Omni-1.5 exhibits competitive performance, rivaling models like Qwen2-VL-72B, particularly in multimodal medical benchmarks.

The key components and contributions include:

  • A comprehensive data cleaning and synthesis pipeline for multimodal data.
  • Baichuan-Audio-Tokenizer to capture both semantic and acoustic information from audio, enhancing compatibility with MLLMs.
  • A multi-stage training strategy for effective synergy across all modalities.
  • OpenAudio-Bench, an open-source audio understanding and generation benchmark for evaluating end-to-end audio capabilities.
  • OpenMM-Medical, a comprehensive medical understanding benchmark, achieving SOTA performance on it using a 7B LLM, surpassing Qwen2-VL-72B's score of 80.7\% with a score of 83.8\%.

The architecture of Baichuan-Omni-1.5 comprises a visual branch, an audio branch, and a pre-trained LLM backbone. The visual branch employs NaViT, similar to Qwen2-VL, for processing image and video inputs, along with a two-layer MLP visual projector. The audio branch incorporates the Baichuan-Audio-Tokenizer and a flow matching-based decoder for end-to-end speech processing. The Baichuan-Audio-Tokenizer is based on Residual Vector Quantization (RVQ) with a frame rate of 12.5 Hz.

The training strategy involves a multi-stage approach:

  1. Image-Text Pretrain: Extends an LLM to process and understand visual input.
  2. Image-Audio-Text Pretrain: Expands the LLM to understand audio data in an end-to-end manner, incorporating the Baichuan-Audio-Tokenizer, a newly introduced audio embedding layer, and an independent audio head.
  3. Omni-Modal Pretrain: Trains all parameters using high-quality cross-modal interaction datasets, extending the maximum sequence length to 64k.
  4. Multimodal Supervised Fine-Tuning (SFT): Enhances the model's instruction-following capabilities across a range of tasks, utilizing a dataset of approximately 17 million data pairs across various modalities.

The model was evaluated against proprietary models such as GPT4o mini and GPT4o, open-source general models like MAP-Neo, Qwen1.5-Chat, Llama3-Instruct, OLMo, and open-source omni-modal models such as VITA-1.0, VITA-1.5, Baichuan-Omni, and MiniCPM-o 2.6 across text, image, video, audio, medical, and omni benchmarks.

The evaluation includes the following benchmarks: MMLU, CMMLU, AGIEval, C-Eval, and GAOKAO-Bench. Baichuan-Omni-1.5 demonstrates strong performance on pure-text benchmarks. For example, on MMLU, Llama3-Instruct achieves 67.1\%, while Baichuan-Omni-1.5 reaches 72.2\%.

The evaluation includes the following benchmarks: MMBench-EN, MMBench-CN, SEEDBench, RealWorldQA, MMMU, MathVista, TextVQA, OCRBench, ChartQA, and HallusionBench. The model outperforms the latest open-source models, VITA-1.5 and MiniCPM-o 2.6, on most benchmarks.

The evaluation includes the following benchmarks: Perception-Test, MVBench, VideoMME, and EgoSchema, ActivityNet-QA and MSVD-QA. Baichuan-Omni-1.5 demonstrates comparable performance to proprietary models on benchmarks like Egoschema and VideoMME, and achieves strong performance across open-source multimodal models.

The evaluation includes the following benchmarks: OpenAudioBench which includes Reasoning QA, Spoken Llama Questions, Web Questions, TriviaQA, and AlpacaEval. In the s→\rightarrowt setting, Baichuan-Omni-1.5 significantly outperforms models of the same size in Reasoning QA and AlpacaEval, achieving scores of 50 and 7.79, respectively.

The evaluation includes the OmniBench benchmark with the following setups: 1) Image + Audio, 2) Image Caption + Audio, 3) Image + Audio Transcript, 4) Image Caption + Audio Transcript. Compared to the omni-modal model MiniCPM-o 2.6, Baichuan-Omni-1.5 outperforms it in three of the four settings.

The evaluation includes GMAI-MMBench and OpenMM-Medical. Baichuan-Omni-1.5 achieves the highest performance in both. On OpenMM-Medical, MiniCPM-o 2.6 gets 73.6\%, while Baichuan-Omni-1.5 gets 83.8%.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 8 likes about this paper.