ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Published 18 May 2023 in cs.CV, cs.CL, cs.SD, and eess.AS | (2305.11172v1)

Abstract: In this work, we explore a scalable way for building a general representation model toward unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities. The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs. This design allows for the easy extension of new modalities by adding adapters and FFNs, while also enabling multi-modal fusion through self-attention layers. To pretrain ONE-PEACE, we develop two modality-agnostic pretraining tasks, cross-modal aligning contrast and intra-modal denoising contrast, which align the semantic space of different modalities and capture fine-grained details within modalities concurrently. With the scaling-friendly architecture and pretraining tasks, ONE-PEACE has the potential to expand to unlimited modalities. Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results on a wide range of uni-modal and multi-modal tasks, including image classification (ImageNet), semantic segmentation (ADE20K), audio-text retrieval (AudioCaps, Clotho), audio classification (ESC-50, FSD50K, VGGSound), audio question answering (AVQA), image-text retrieval (MSCOCO, Flickr30K), and visual grounding (RefCOCO/+/g). Code is available at https://github.com/OFA-Sys/ONE-PEACE.

Abstract PDF Upgrade to Chat

Citations (88)

View on Semantic Scholar

Summary

The paper introduces a 4-billion parameter unified model that integrates vision, audio, and language using modality adapters and shared self-attention layers.
It employs cross-modal contrastive and intra-modal denoising tasks to align semantic spaces and enhance feature extraction without pretraining.
Extensive experiments demonstrate state-of-the-art performance on benchmarks such as 89.8% accuracy on ImageNet and 63.0% mIoU on ADE20K.

The paper "ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities" introduces a comprehensive approach for building a general representation model that seamlessly integrates and aligns data across multiple modalities, specifically vision, audio, and language. ONE-PEACE, endowed with 4 billion parameters, emphasizes scalability and extensibility, making it capable of potentially expanding to unlimited modalities.

Architectural Design

The architecture of ONE-PEACE consists of modality adapters, shared self-attention layers, and modality-specific feed-forward networks (FFNs). This design facilitates adaptability, allowing new modalities to be incorporated by adding modality-specific components while leveraging shared layers for cross-modal integration.

Modality Adapters: These adapters process raw input data into feature sequences for vision, audio, and language. Each modality uses distinct transformation strategies suitable for its data type.
Modality Fusion Encoder: Incorporates shared self-attention layers enabling interaction across modalities and modality-specific FFNs for intra-modal information extraction.

Pretraining Strategy

ONE-PEACE employs two innovative, modality-agnostic pretraining tasks:

Cross-Modal Contrastive Learning: This task aligns the semantic spaces of different modalities using a contrastive approach. It involves maximizing the similarity between related pairs while minimizing it for unrelated ones, without the reliance on pretrained models for initialization.
Intra-Modal Denoising Contrastive Learning: Enhances fine-grained feature extraction within modalities by combining masked prediction and contrastive learning. This task results in superior fine-tuning performance across diverse tasks.

Experimental Insights

The effectiveness of ONE-PEACE is validated through extensive experimentation across several uni-modal and multi-modal tasks, demonstrating superior or competitive performance on datasets like ImageNet for image classification, ADE20K for semantic segmentation, and various audio and vision-language benchmarks. Noteworthy numerical results include:

Image Classification: Achieved $89.8\%$ accuracy on ImageNet without using any pretrained model for initialization.
Semantic Segmentation: Attained $63.0\%$ mIoU on ADE20K.
Audio-Text Retrieval and Audio Classification: Outperformed previous state-of-the-art models by significant margins on datasets such as AudioCaps and ESC-50.

Implications and Future Directions

The development of ONE-PEACE marks a critical step towards creating highly extensible and unified models that can handle increasingly complex and diverse data modalities. The model's architecture allows for seamless integration of new modalities, which holds potential for future applications in AI systems requiring multi-modal understanding.

The research addresses the current challenges of integrating distinct modalities by leveraging a shared architecture for effective cross-modal interaction. Future work could explore the integration of additional modalities, such as video or 3D data, and further collaboration with LLMs to enhance language-based interactions.

In conclusion, ONE-PEACE represents a significant stride towards realizing general representation models that can concurrently process and integrate multiple data modalities, paving the way for more intelligent and versatile AI applications.