OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

Published 13 Jun 2024 in cs.CV | (2406.09399v1)

Abstract: Tokenizer, serving as a translator to map the intricate visual data into a compact latent space, lies at the core of visual generative models. Based on the finding that existing tokenizers are tailored to image or video inputs, this paper presents OmniTokenizer, a transformer-based tokenizer for joint image and video tokenization. OmniTokenizer is designed with a spatial-temporal decoupled architecture, which integrates window and causal attention for spatial and temporal modeling. To exploit the complementary nature of image and video data, we further propose a progressive training strategy, where OmniTokenizer is first trained on image data on a fixed resolution to develop the spatial encoding capacity and then jointly trained on image and video data on multiple resolutions to learn the temporal dynamics. OmniTokenizer, for the first time, handles both image and video inputs within a unified framework and proves the possibility of realizing their synergy. Extensive experiments demonstrate that OmniTokenizer achieves state-of-the-art (SOTA) reconstruction performance on various image and video datasets, e.g., 1.11 reconstruction FID on ImageNet and 42 reconstruction FVD on UCF-101, beating the previous SOTA methods by 13% and 26%, respectively. Additionally, we also show that when integrated with OmniTokenizer, both LLM-based approaches and diffusion models can realize advanced visual synthesis performance, underscoring the superiority and versatility of our method. Code is available at https://github.com/FoundationVision/OmniTokenizer.

Abstract PDF HTML Upgrade to Chat

Citations (13)

View on Semantic Scholar

Summary

The paper introduces OmniTokenizer, a unified tokenizer that integrates spatial and temporal processing for joint image and video tokenization.
It employs a spatial-temporal decoupled architecture with window and causal attention, achieving a 13% lower reconstruction FID on ImageNet and a 26% improvement on UCF-101.
The model’s progressive training approach and unified framework enable scalable deployment in generative tasks across both static and dynamic visual content.

A Joint Image-Video Tokenizer for Visual Generation

This paper introduces \system, a novel transformer-based tokenizer designed for joint image and video tokenization, addressing key limitations in previous separate tokenizer models for different visual modalities. The research offers a comprehensive solution that enhances visual generation models by integrating spatial and temporal learning within a unified architecture.

Methodology and Design

At the core of \system is a spatial-temporal decoupled architecture, which employs window attention in the spatial dimension to enable efficient local aggregation and causal attention in the temporal dimension for coherent motion modeling. Unlike conventional methods that rely on separate frameworks for images and videos, \system is built on a shared architecture that leverages a progressive training paradigm to harness the complementary nature of these data types.

The proposed training strategy involves two stages: an initial image-only training phase to build spatial encoding capabilities, followed by a multi-resolution joint training phase with both image and video data, which promotes temporal modeling proficiency. Through this progressive learning approach, \system unifies the tokenization process across visual modalities, facilitating more versatile and scalable generative models.

Empirical Validation

\system's efficacy is demonstrated through extensive experiments on datasets including ImageNet, CelebA-HQ, FFHQ, UCF-101, Kinetics, and Others, where it achieves superior reconstruction performance. Notably, the research reports a 1.11 reconstruction FID on ImageNet and a 42 reconstruction FVD on UCF-101, outperforming existing methods by margins of 13% and 26%, respectively. This enhanced performance is indicative of the effective integration of spatial and temporal dynamics within a single model framework.

Applicability to Generative Models

When incorporated into generative frameworks, \system exhibits marked improvements in visual synthesis. Specifically, LLMs and diffusion models utilizing \system excel in generation tasks such as class-conditional and unconditional generation, as well as frame prediction. The ability to decode both static images and dynamic video sequences using a consistent set of parameters underlines the potential of the shared framework in achieving high-quality generative outputs.

Implications and Future Directions

The implications of this research are manifold. Practically, \system paves the way for more flexible and efficient deployments of generative models across varied visual media without necessitating distinct models for each modality. Theoretically, it expands our understanding of multi-modal learning and sets the stage for further exploration into scalable and unified architectures for diverse AI applications.

Looking forward, the scalability of \system suggests potential for advancements in efficiency and effectiveness as model and dataset sizes continue to grow. Additionally, future research could explore the extension of this framework to other modalities and tasks, broadening its application scope within artificial intelligence.

In conclusion, \system represents a significant stride in the integration of visual modalities through innovative design and training strategies, offering a robust tool for future generative model developments.

Markdown Report Issue