Fréchet Wavelet Distance: A Domain-Agnostic Metric for Image Generation

Published 23 Dec 2023 in cs.CV, cs.LG, and eess.IV | (2312.15289v3)

Abstract: Modern metrics for generative learning like Fr\'echet Inception Distance (FID) and DINOv2-Fr\'echet Distance (FD-DINOv2) demonstrate impressive performance. However, they suffer from various shortcomings, like a bias towards specific generators and datasets. To address this problem, we propose the Fr\'echet Wavelet Distance (FWD) as a domain-agnostic metric based on the Wavelet Packet Transform ($W_p$). FWD provides a sight across a broad spectrum of frequencies in images with a high resolution, preserving both spatial and textural aspects. Specifically, we use $W_p$ to project generated and real images to the packet coefficient space. We then compute the Fr\'echet distance with the resultant coefficients to evaluate the quality of a generator. This metric is general-purpose and dataset-domain agnostic, as it does not rely on any pre-trained network, while being more interpretable due to its ability to compute Fr\'echet distance per packet, enhancing transparency. We conclude with an extensive evaluation of a wide variety of generators across various datasets that the proposed FWD can generalize and improve robustness to domain shifts and various corruptions compared to other metrics.

Abstract PDF HTML Upgrade to Chat

References (52)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces the WPSKL metric to address FID's biases by integrating spatial and frequency-domain insights.
It employs a wavelet packet transform to compute KL divergence, capturing subtle differences undetected by conventional methods.
Experimental results on datasets like CIFAR10 and CelebAHQ show WPSKL’s superior robustness and alignment with human perception.

Wavelet Packet Power Spectrum Kullback-Leibler Divergence: A New Metric for Image Synthesis

In this study, the authors address the limitations inherent in current metrics used to evaluate generative neural networks, particularly focusing on the Fréchet Inception Distance (FID). FID, while popular, has numerous weaknesses such as bias towards specific image datasets, susceptibility to slight numerical changes, and an overemphasis on low-frequency information. This paper proposes an alternative metric, the Wavelet Packet Power Spectrum Kullback-Leibler Divergence (WPSKL), which aims to provide a more comprehensive evaluation by integrating both spatial and frequency-domain insights.

Problem Motivation and Existing Shortcomings

Existing metrics like FID heavily depend on pre-trained neural networks, introducing biases that affect their accuracy across diverse datasets and architectures. FID also requires a Gaussian distribution assumption, which may not always hold, and is sensitive to computational variations such as image resizing. These limitations lead to inconsistencies in results, as evidenced when small pixel changes, undetectable by human observers, can cause large fluctuations in FID scores.

Proposed Solution: WPSKL

WPSKL leverages the Wavelet Packet Transform (WPT) to evaluate images by capturing both spatial and frequency information. Wavelets provide a fine-grained analysis of image data by decomposing it into a set of high- and low-frequency components, preserving spatial attributes while offering frequency analysis. The WPSKL metric uses the wavelet power spectrum to compute the Kullback-Leibler (KL) divergence between the distributions of real and synthesized images. This frequency-domain focus allows for the detection of differences that are invisible to FID, particularly in cases where images are perceptually similar yet differ in frequency content.

Experimental Evaluation

The paper systematically applies the WPSKL metric across various generative models, including both GANs and diffusion models, using datasets like CIFAR10, CelebAHQ, and LSUN. The results reveal that WPSKL aligns more consistently with human perception compared to existing metrics. Notably, it shows superior robustness to slight image perturbations, providing stable feedback even when common issues such as numerical rounding or dataset-specific biases challenge FID. Furthermore, a user study corroborates the reliability of WPSKL, as it tends to agree with human judgments on image quality.

Implications and Future Directions

WPSKL offers a significant improvement in evaluating generative models, particularly in scenarios where high-level feature-based assessments such as FID are inadequate. By accounting for both spatial and frequency details, WPSKL opens new avenues for improving generative model architecture and training. Future work could explore the integration of WPSKL in real-time generative applications or its adaptation to different domains, such as video synthesis, where temporal-frequency information could further enhance its applicability.

Overall, the authors propose a rigorous, frequency-sensitive metric that can serve as a more reliable benchmark for the ongoing evaluation and enhancement of generative neural networks, promising more accurate assessments that align with human visual interpretation.

Markdown Report Issue