Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

Published 27 Feb 2024 in cs.CV, cs.MM, cs.SD, and eess.AS | (2402.17723v1)

Abstract: Video and audio content creation serves as the core technique for the movie industry and professional users. Recently, existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation. We observe the powerful generation ability of off-the-shelf video or audio generation models. Thus, instead of training the giant models from scratch, we propose to bridge the existing strong models with a shared latent representation space. Specifically, we propose a multimodality latent aligner with the pre-trained ImageBind model. Our latent aligner shares a similar core as the classifier guidance that guides the diffusion denoising process during inference time. Through carefully designed optimization strategy and loss functions, we show the superior performance of our method on joint video-audio generation, visual-steered audio generation, and audio-steered visual generation tasks. The project website can be found at https://yzxing87.github.io/Seeing-and-Hearing/

Abstract PDF HTML Upgrade to Chat

References (50)

Citations (32)

View on Semantic Scholar

Summary

The paper proposes a novel diffusion latent aligner that synchronizes visual and audio modalities by leveraging a shared embedding from pre-trained models.
The method integrates pre-existing single-modality models without extensive retraining, achieving superior results in benchmarks like FVD, KVD, and AV-align.
Experiments demonstrate significant improvements in semantic coherence and fidelity of generated multimodal content, offering a versatile approach for AI-driven multimedia creation.

Open-domain Visual-Audio Generation with Diffusion Latent Aligners

Introduction

The paper addresses the challenge of open-domain visual-audio generation, aiming to create synchronized video and audio content. This task has significant implications for content creation, enhancing multimedia experiences across various domains. The authors navigate the complexities of generating multimodal content by leveraging pre-existing, high-performance, single-modality generation models. They introduce an innovative approach that unifies these models through a shared latent representation space, facilitated by a Multimodality Latent Aligner built upon the ImageBind model. This work stands out by offering a versatile and resource-efficient solution to the joint visual-audio generation problem, showcasing notable improvements over existing methods.

Methods

Problem Formulation

The authors propose an optimization framework that integrates different modalities into a coherent generation process without requiring large-scale dataset training for new modalities. The process hinges on the concept of a Diffusion Latent Aligner, which uses the shared embedding space of ImageBind to guide the generation towards alignment with input conditions. This aligner acts during the denoising steps of the diffusion process, modifying latent variables to ensure compatibility between generated video and audio, or between any input and target modalities.

Diffusion Latent Aligner

The core of their method, the Diffusion Latent Aligner, operates by injecting alignment information during the generative process. It achieves this by measuring the distance between the generated content and the input condition within the ImageBind embedding space, then using this distance as feedback to adjust the generation trajectory. This approach represents a significant technical innovation, as it directly leverages the multimodal nature of the ImageBind model without additional resource-intensive retraining.

Experiments

The authors conduct comprehensive experiments to validate their framework, covering scenarios like video-to-audio, audio-to-video, joint video-audio generation, and image-to-audio generation. Through these experiments, the framework demonstrated its superiority in generating aligned and high-quality multimodal content. The results show significant improvements in benchmarks such as Frechet Video Distance (FVD), Kernel Video Distance (KVD), audio-video alignment (AV-align), among others, indicating enhanced fidelity and semantic coherence in generated content.

Discussion and Future Directions

Implications

This research introduces an elegant solution to multimodal content generation, offering tangible improvements in alignment and quality. The approach benefits from avoiding the training of new, large models by intelligently leveraging existing resources, presenting a cost-effective and flexible methodology for visual-audio generation tasks.

Limitations and Future Work

While the framework achieves impressive performance, it inherits limitations from the base generative models it employs. Thus, future enhancements in these foundational models could further elevate performance. Additionally, exploring the application of this method in generating content for more modalities or in more constrained or specific domains could yield fruitful research avenues.

Conclusion

This paper presents a novel framework for open-domain visual-audio content generation that bridges the gap between pre-existing single-modality models through a shared, multimodal latent space. The approach demonstrates significant advancements in generating semantically aligned and high-quality multimodal content, marking a notable contribution to the field of AI-driven multimedia creation.