HiWave: Training-Free High-Resolution Image Generation via Wavelet-Based Diffusion Sampling

Published 25 Jun 2025 in cs.CV and cs.LG | (2506.20452v1)

Abstract: Diffusion models have emerged as the leading approach for image synthesis, demonstrating exceptional photorealism and diversity. However, training diffusion models at high resolutions remains computationally prohibitive, and existing zero-shot generation techniques for synthesizing images beyond training resolutions often produce artifacts, including object duplication and spatial incoherence. In this paper, we introduce HiWave, a training-free, zero-shot approach that substantially enhances visual fidelity and structural coherence in ultra-high-resolution image synthesis using pretrained diffusion models. Our method employs a two-stage pipeline: generating a base image from the pretrained model followed by a patch-wise DDIM inversion step and a novel wavelet-based detail enhancer module. Specifically, we first utilize inversion methods to derive initial noise vectors that preserve global coherence from the base image. Subsequently, during sampling, our wavelet-domain detail enhancer retains low-frequency components from the base image to ensure structural consistency, while selectively guiding high-frequency components to enrich fine details and textures. Extensive evaluations using Stable Diffusion XL demonstrate that HiWave effectively mitigates common visual artifacts seen in prior methods, achieving superior perceptual quality. A user study confirmed HiWave's performance, where it was preferred over the state-of-the-art alternative in more than 80% of comparisons, highlighting its effectiveness for high-quality, ultra-high-resolution image synthesis without requiring retraining or architectural modifications.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a training-free method for ultra-high resolution image synthesis by combining patch-wise DDIM inversion with wavelet-based detail enhancement.
The method leverages a two-stage pipeline that generates a base image and refines it using discrete wavelet transform to preserve structural coherence and enhance textures.
The approach outperforms prior methods by improving semantic accuracy and fine detail, benefiting applications in advertising and film production.

HiWave: Training-Free High-Resolution Image Generation via Wavelet-Based Diffusion Sampling

Introduction

The paper presents HiWave, an innovative approach to high-resolution image generation that leverages pretrained diffusion models without the need for retraining. Traditional methods for generating high-resolution images often require extensive computational resources, particularly when training at resolutions beyond 1024x1024 pixels. Existing techniques face challenges such as object duplication and spatial incoherence. HiWave addresses these issues by introducing a training-free, zero-shot approach that enhances visual fidelity and structural coherence in ultra-high-resolution image synthesis.

Methodology

HiWave operates through a two-stage pipeline: it initiates by generating a base image from a pretrained model and subsequently employs a patch-wise Denoising Diffusion Implicit Models (DDIM) inversion process along with a novel wavelet-based detail enhancement module. This approach utilizes discrete wavelet transform (DWT) to manage frequency bands effectively—preserving low-frequency components to maintain structural consistency while refining high-frequency details to enrich textures and finer nuances.

The sampling process starts by generating an initial image with a pretrained diffusion model using standard methods, where noise is progressively converted into a clean image conditioned on text prompts. The base image is then upscaled through image-domain interpolation, and encoded into latent space via a VAE. Patch-wise DDIM inversion is performed to map the upscaled image back to its noise representation, ensuring that the final generated image maintains global coherence.

Implementation Details

Base Image Generation: A base image is first generated at the model's native resolution (e.g., 1024x1024), then upscaled using Lanczos interpolation to the target high resolution (4096x4096).
Patch-wise DDIM Inversion: This method retrieves the initial noise vectors from the base image through forward-time DDIM, facilitating coherent patch initialization.
Wavelet-Based Detail Enhancement: DWT is employed to guide the generation process. Low-frequency structural components are retained from the base image to ensure coherence, while high-frequency components are dynamically guided, ensuring finer detail synthesis.
Skip Residuals: These are incorporated during early sampling steps to further preserve the global image structure without suppressing detailed synthesis.

Results

HiWave effectively generates images at 4096x4096 resolution while maintaining fine details and minimizing artifacts such as duplication. Qualitative comparisons reveal superior perceptual quality over prior methods, achieving high human preference scores in user studies. The method demonstrates a notable improvement in fine detail and semantic accuracy compared to the base images, without retraining or modifying existing model architectures.

HiWave's frequency-aware sampling strategy ensures high-quality synthesis, significantly reducing common pitfalls like artifact accumulation and inconsistency across image patches. The results underscore HiWave's potential for applications demanding high-fidelity images, such as advertising and film production, where ultra-high-resolution outputs are crucial.

Conclusion

HiWave represents a significant advancement in high-resolution image generation by integrating pretrained diffusion models with discrete wavelet transform-based guidance, achieving detailed and coherent outputs without training overhead. Future directions may include exploring optimizations for real-time applications, extensions to video generation, and refining guidance strategies for even richer detail synthesis.

Markdown Report Issue