Papers
Topics
Authors
Recent
Search
2000 character limit reached

Learning to Generate Images with Perceptual Similarity Metrics

Published 19 Nov 2015 in cs.LG and cs.CV | (1511.06409v3)

Abstract: Deep networks are increasingly being applied to problems involving image synthesis, e.g., generating images from textual descriptions and reconstructing an input image from a compact representation. Supervised training of image-synthesis networks typically uses a pixel-wise loss (PL) to indicate the mismatch between a generated image and its corresponding target image. We propose instead to use a loss function that is better calibrated to human perceptual judgments of image quality: the multiscale structural-similarity score (MS-SSIM). Because MS-SSIM is differentiable, it is easily incorporated into gradient-descent learning. We compare the consequences of using MS-SSIM versus PL loss on training deterministic and stochastic autoencoders. For three different architectures, we collected human judgments of the quality of image reconstructions. Observers reliably prefer images synthesized by MS-SSIM-optimized models over those synthesized by PL-optimized models, for two distinct PL measures ($\ell_1$ and $\ell_2$ distances). We also explore the effect of training objective on image encoding and analyze conditions under which perceptually-optimized representations yield better performance on image classification. Finally, we demonstrate the superiority of perceptually-optimized networks for super-resolution imaging. Just as computer vision has advanced through the use of convolutional architectures that mimic the structure of the mammalian visual system, we argue that significant additional advances can be made in modeling images through the use of training objectives that are well aligned to characteristics of human perception.

Citations (167)

Summary

  • The paper demonstrates that using perceptual loss (MS-SSIM) in autoencoders results in images that are rated higher in quality by human observers compared to traditional losses like MSE and MAE.
  • It shows that perceptual similarity metrics not only enhance image reconstruction but also improve feature extraction for tasks such as image classification and super-resolution.
  • The study provides actionable insights on optimizing neural network objectives to align with human visual perception, reducing artifacts and enhancing overall image fidelity.

Summary of "Learning to Generate Images With Perceptual Similarity Metrics"

Introduction

The paper "Learning to Generate Images With Perceptual Similarity Metrics" explores the use of perceptual similarity metrics, in particular, the multiscale structural similarity score (MS-SSIM), as a loss function in image synthesis tasks. The goal is to better align the training of neural networks with human perceptual judgments of image quality, as traditional pixel-wise loss functions such as mean squared error (MSE) often lead to suboptimal perceptual quality. The research compares MS-SSIM with traditional loss measures in deterministic and probabilistic autoencoders, demonstrating superior performance in generating images that humans perceive as high-quality.

Perception-Based Error Metrics

The paper highlights that while pixel-wise loss functions like MSE are computationally convenient, they do not adequately capture human perception of image quality. Previous work on perceptual error metrics, including models like SSIM and its multiscale counterpart MS-SSIM, offers a more human-aligned approach to evaluating image quality. The paper leverages these perceptual metrics to train neural networks to generate images, focusing on deterministic and stochastic autoencoders where MS-SSIM is used as a training objective due to its differentiable nature.

Deterministic Autoencoders

Deterministic autoencoders trained with SSIM or MS-SSIM are shown to produce images that human observers consistently rate higher in quality compared to those optimized with MSE or mean absolute error (MAE). The study involves fully connected and convolutional architectures trained on diverse datasets, demonstrating robust performance across varying image sizes and network configurations. The results indicate a strong human preference for reconstructions synthetically optimized with perceptual metrics. Figure 1

Figure 1

Figure 1: Human judgments of reconstructed images. (a) Fully connected network: Proportion of participants preferring SSIM to MSE for each of 100 image triplets. (b) Deterministic conv. network: Distribution of image quality ranking for MS-SSIM, MSE, and MAE for 1000 images from the STL-10 hold-out set.

Probabilistic Autoencoders

The research further explores the use of MS-SSIM within the framework of variational autoencoders (VAEs), adapted as Expected-Loss VAEs (EL-VAEs). By incorporating perceptual losses into the EL-VAE framework, the paper demonstrates that perceptually-optimized autoencoders outperform traditional models in generating novel images with high perceptual quality, as confirmed by human evaluations. Figure 2

Figure 2

Figure 2: (a) Four randomly selected, held-out STL-10 images and their reconstructions. For these images, the MS-SSIM reconstruction was ranked as best by humans. Reconstructions are from the 128-hidden-unit VAEs. From left to right are the original image, followed by the MS-SSIM, MSE, and MAE reconstructions. (b) Four randomly selected test images where the MS-SSIM reconstruction was ranked second or third.

Image Classification and Super-Resolution

Beyond reconstruction tasks, the paper also addresses whether perceptually-based training objectives aid in acquiring image representations useful for ancillary tasks like classification. Experiments using SVM on bottleneck features of deterministically trained autoencoders suggest improved performance in predicting image features, such as identity and lighting conditions, when trained with perceptual losses.

In image super-resolution tasks, perceptual losses enhance image detail recovery while reducing artifacts, achieving quality improvements recognizable by standard metrics like PSNR and SSIM. Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: Visual comparisons on super-resolution at a magnification factor of 4. MS-SSIM not only improves resolution but also removes artifacts, \eg, the ringing effect in the bottom row, and enhances contrast, \eg, the fabric in the third row.

Conclusion

The paper makes a compelling case for integrating perceptual similarity metrics in training objectives for neural networks tasked with image generation, advocating for the paradigm shift from traditional pixel-wise losses to those more aligned with human visual perception. It sets a foundation for further research into perceptual metrics and their potential applications across various domains in computer vision, promising significant advancements in tasks requiring highly detailed and perceptually coherent imagery.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.