NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

Published 20 Jul 2022 in cs.CV | (2207.09814v2)

Abstract: In this paper, we present NUWA-Infinity, a generative model for infinite visual synthesis, which is defined as the task of generating arbitrarily-sized high-resolution images or long-duration videos. An autoregressive over autoregressive generation mechanism is proposed to deal with this variable-size generation task, where a global patch-level autoregressive model considers the dependencies between patches, and a local token-level autoregressive model considers dependencies between visual tokens within each patch. A Nearby Context Pool (NCP) is introduced to cache-related patches already generated as the context for the current patch being generated, which can significantly save computation costs without sacrificing patch-level dependency modeling. An Arbitrary Direction Controller (ADC) is used to decide suitable generation orders for different visual synthesis tasks and learn order-aware positional embeddings. Compared to DALL-E, Imagen and Parti, NUWA-Infinity can generate high-resolution images with arbitrary sizes and support long-duration video generation additionally. Compared to NUWA, which also covers images and videos, NUWA-Infinity has superior visual synthesis capabilities in terms of resolution and variable-size generation. The GitHub link is https://github.com/microsoft/NUWA. The homepage link is https://nuwa-infinity.microsoft.com.

Abstract PDF Upgrade to Chat

Citations (61)

View on Semantic Scholar

Summary

The paper introduces a dual autoregressive framework that divides visual synthesis into global patch-level and local token-level generation for infinite image and video creation.
It employs a Nearby Context Pool and an Arbitrary Direction Controller to reduce computation costs and manage patch ordering dynamically.
Experimental results show superior performance in FID, CLIP-SIM, Block-FID, and FVD scores compared to baselines like Taming Transformer and MaskGIT.

Overview of "NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis"

The paper "NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis" presents an innovative methodology for high-resolution image and video generation. The authors introduce NUWA-Infinity, a model capable of producing arbitrarily-sized visual content, distinguishing itself from previous models like DALL·E, Imagen, and Parti, which are restricted to fixed-size outputs.

Key Contributions

NUWA-Infinity leverages an autoregressive over autoregressive framework, dissecting the synthesis process into two levels: global patch-level and local token-level generation. This dual-layer approach effectively models dependencies both between patches and within patches, enabling the creation of consistent and detailed visual outputs.

Autoregressive Mechanism: The dual autoregressive structure allows for nuanced processing of visual content, capturing complex dependencies to maintain consistency across large-scale images and videos.
Nearby Context Pool (NCP): The NCP saves computation costs by storing and utilizing caches of previously generated patches, preserving contextual integrity without extensive computational overhead.
Arbitrary Direction Controller (ADC): This component manages patch generation orders and assigns positional embeddings dynamically, supporting nuanced outpainting tasks.

Experimental Evaluation

The model is evaluated across five tasks: Unconditional Image Generation\textsuperscript{HD}, Text-to-Image\textsuperscript{HD}, Image Outpainting\textsuperscript{HD}, Image Animation\textsuperscript{HD}, and Text-to-Video\textsuperscript{HD}. Notably, NUWA-Infinity outperforms alternative approaches like Taming Transformer and MaskGIT in generating high-resolution imagery with improved visual quality and semantic consistency.

For Text-to-Image\textsuperscript{HD}, NUWA-Infinity demonstrates robust performance with significant improvements in FID and CLIP-SIM scores, even when generated outputs extend significantly beyond training image dimensions.
In Image Outpainting\textsuperscript{HD}, the model illustrates superior capability in directional image extension, achieving better Block-FID scores compared to baselines.
The Image Animation\textsuperscript{HD} task showcases NUWA-Infinity's proficiency in generating temporally consistent video outputs, evidenced by lower FVD scores.

Implications and Future Directions

The advancement presented by NUWA-Infinity is pertinent for applications requiring scalable and varied visual content generation, such as virtual design, multimedia production, and augmented reality. Its ability to seamlessly extend images and construct long-duration videos while maintaining high fidelity is particularly advantageous in these domains.

Future developments could focus on optimizing the model’s computational efficiency further, potentially integrating non-autoregressive elements to accelerate inference time. Additionally, expansion of training datasets could enhance the model’s generalization capabilities, thereby facilitating broader real-world applicability.

Conclusion

The introduction of NUWA-Infinity marks a significant progression in visual synthesis technology, addressing limitations in scalability and resolution found in previous models. Through its autoregressive framework, NUWA-Infinity not only improves upon existing methods but also sets a foundation for future research into infinitely scalable visual content generation.