Papers
Topics
Authors
Recent
Search
2000 character limit reached

Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution

Published 25 Mar 2024 in cs.CV and cs.MM | (2403.17000v1)

Abstract: Diffusion models are just at a tipping point for image super-resolution task. Nevertheless, it is not trivial to capitalize on diffusion models for video super-resolution which necessitates not only the preservation of visual appearance from low-resolution to high-resolution videos, but also the temporal consistency across video frames. In this paper, we propose a novel approach, pursuing Spatial Adaptation and Temporal Coherence (SATeCo), for video super-resolution. SATeCo pivots on learning spatial-temporal guidance from low-resolution videos to calibrate both latent-space high-resolution video denoising and pixel-space video reconstruction. Technically, SATeCo freezes all the parameters of the pre-trained UNet and VAE, and only optimizes two deliberately-designed spatial feature adaptation (SFA) and temporal feature alignment (TFA) modules, in the decoder of UNet and VAE. SFA modulates frame features via adaptively estimating affine parameters for each pixel, guaranteeing pixel-wise guidance for high-resolution frame synthesis. TFA delves into feature interaction within a 3D local window (tubelet) through self-attention, and executes cross-attention between tubelet and its low-resolution counterpart to guide temporal feature alignment. Extensive experiments conducted on the REDS4 and Vid4 datasets demonstrate the effectiveness of our approach.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Real-time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation. In CVPR, 2017.
  2. BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond. In CVPR, 2021.
  3. BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment. In CVPR, 2022a.
  4. Investigating Tradeoffs in Real-World Video Super-Resolution. In CVPR, 2022b.
  5. Two Deterministic Half-quadratic Regularization Algorithms for Computed Imaging. In ICIP, 1994.
  6. AnchorFormer: Point Cloud Completion from Discriminative Nodes. In CVPR, 2023.
  7. ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models. In ICCV, 2021.
  8. Perception Prioritized Training of Diffusion Models. In CVPR, 2022.
  9. Improving Diffusion Dodels for Inverse Problems Using Manifold Constraints. In NeurIPS, 2022.
  10. Diffusion Posterior Sampling for General Noisy Inverse Problems. In ICLR, 2023.
  11. Diffusion Models Beat GANs on Image Synthesis. In NeurIPS, 2021.
  12. Image Quality Assessment: Unifying Structure and Texture Similarity. IEEE TPAMI, 2020.
  13. Generative Diffusion Prior for Unified Image Restoration and Enhancement. In CVPR, 2023.
  14. RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution. In CVPR, 2022.
  15. Recurrent Back-Projection Network for Video Super-Resolution. In CVPR, 2019.
  16. Prompt-to-Prompt Image Editing with Cross-Attention Control. In ICLR, 2023.
  17. Video Super-Resolution via Bidirectional Recurrent Convolutional Networks. IEEE TPAMI, 2017.
  18. Video Super-Resolution with Recurrent Structure-Detail Network. In ECCV, 2020a.
  19. Video Super-resolution with Temporal Group Attention. In CVPR, 2020b.
  20. Deep Video Super-Resolution Network Using Dynamic Upsampling Filters Without Explicit Motion Compensation. In CVPR, 2018.
  21. Denoising Diffusion Restoration Models. In NeurIPS, 2022.
  22. MuCAN: Multi-Correspondence Aggregation Network for Video Super-Resolution. In ECCV, 2020.
  23. VRT: A Video Restoration Transformer. arXiv:2201.12288, 2022a.
  24. Recurrent Video Restoration Transformer with Guided Deformable Attention. In NeurIPS, 2022b.
  25. On Bayesian Adaptive Video Super Resolution. IEEE TPAMI, 2013.
  26. Learning Trajectory-Aware Transformer for Video Super-Resolution. In CVPR, 2022.
  27. Stand-Alone Inter-Frame Attention in Video Models. In CVPR, 2022a.
  28. Dynamic Temporal Filtering in Video Models. In ECCV, 2022b.
  29. PointClustering: Unsupervised Point Cloud Pre-training using Transformation Invariance in Clustering. In CVPR, 2023.
  30. VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM. arXiv:2401.01256, 2024.
  31. Diffusion Model Based Posterior Sampling for Noisy Linear Inverse Problems. arXiv:2211.12343, 2022.
  32. Making a “Completely Blind” Image Quality Analyzer. IEEE SPL, 2012.
  33. NTIRE 2019 Challenge on Video Deblurring and Super-Resolution: Dataset and Study. In CVPRW, 2019.
  34. Glide: Towards Photorealistic Image Generation and Editing with Text-guided Diffusion Models. In ICML, 2022.
  35. Learning Transferable Visual Models From Natural Language Supervision. In ICML, 2021.
  36. High-Resolution Image Synthesis with Latent Diffusion Models. In CVPR, 2022.
  37. Palette: Image-to-Image Diffusion Models. In ACM SIGGRAPH, 2022.
  38. Frame-Recurrent Video Super-Resolution. In CVPR, 2018.
  39. Rethinking Alignment in Video Super-Resolution Transformers. In NeurIPS, 2022.
  40. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In CVPR, 2016.
  41. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR, 2015.
  42. Pseudoinverse-guided Diffusion Models for Inverse Problems. In ICLR, 2022.
  43. TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution. In CVPR, 2020.
  44. Diffusers: State-of-the-art Diffusion Models, 2022.
  45. Exploring CLIP for Assessing the Look and Feel of Images. In AAAI, 2023a.
  46. Exploiting Diffusion Prior for Real-World Image Super-Resolution. arXiv:2305.07015, 2023b.
  47. Deep Video Super-Resolution using HR Optical Flow Estimation. IEEE TIP, 2020.
  48. EDVR: Video Restoration with Enhanced Deformable Convolutional Networks. In CVPRW, 2019.
  49. Zero-Shot Image Restoration Using Denoising Diffusion Null-Space Model. In ICLR, 2023c.
  50. Temporal Modulation Network for Controllable Space-Time Video Super-Resolution. In CVPR, 2021.
  51. Video Enhancement with Task-Oriented Flow. IJCV, 2019.
  52. Pixel-Aware Stable Diffusion for Realistic Image Super-Resolution and Personalized Stylization. arXiv:2308.14469, 2023.
  53. Progressive Fusion Video Super-Resolution Network via Exploiting Non-Local Spatio-Temporal Correlations. In ICCV, 2019.
  54. Omniscient Video Super-Resolution. In ICCV, 2021.
  55. Adding Conditional Control to Text-to-Image Diffusion Models. In ICCV, 2023.
  56. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR, 2018.
  57. Denoising Diffusion Models for Plug-and-Play Image Restoration. In CVPRW, 2023.
Citations (5)

Summary

  • The paper proposes SATeCo, demonstrating that diffusion models with SFA and TFA can significantly enhance spatial fidelity and temporal coherence in video super-resolution.
  • It integrates transformer-based upscaling with VAE and UNet modules to modulate pixel features and align frame information for consistent high-resolution synthesis.
  • Extensive experiments on REDS4 and Vid4 show SATeCo achieves superior perceptual metrics compared to existing methods, bridging regression and diffusion approaches.

"Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution" Essay

Introduction

The research paper entitled "Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution" proposes a novel approach, SATeCo, aimed at capitalizing on diffusion models for video super-resolution (VSR). Unlike traditional methods, SATeCo focuses on addressing the dual challenges of maintaining spatial fidelity and ensuring temporal coherence across video frames. This is achieved by introducing spatial-temporal guidance that leverages low-resolution (LR) videos for high-resolution (HR) video synthesis through a combination of transformer-based upscaling, latent-space denoising, and pixel-space reconstruction strategies.

SATeCo Architecture Overview

The SATeCo architecture comprises several key components, depicted in Figure 1. Initially, an input LR video undergoes upscaling using a transformer-based video upscaler, producing a resolution-enhanced video that enters the Variational Autoencoder (VAE) pipeline. Within the VAE framework, spatial feature adaptation (SFA) and temporal feature alignment (TFA) modules are inserted into both the UNet and VAE decoders. These modules modulate pixel features and ensure temporal consistency through self-attention and cross-attention processes, as further detailed below. Figure 1

Figure 1: An overview of our SATeCo architecture illustrating the workflow from LR video input to HR video output.

Spatial Feature Adaptation and Temporal Feature Alignment

The paper introduces SFA and TFA as critical modules for achieving high spatial fidelity and temporal coherence:

  • Spatial Feature Adaptation: SFA ensures pixel-wise feature modulation by estimating affine parameters (scale and bias) derived from LR video latent features. This modulation is critical for enhancing spatial fidelity during the video denoising process within the latent space.
  • Temporal Feature Alignment: TFA facilitates temporal coherence by leveraging tubelet-based self-attention and cross-attention mechanisms. This method aligns temporal features within a local 3D window (tubelet) across frames and calibrates HR features with their LR counterparts, addressing the challenge of frame inconsistency. Figure 2

    Figure 2: An illustration of (c) spatial feature adaptation and (d) temporal feature alignment modules.

Experimental Evaluation

The paper conducts extensive experiments on the REDS4 and Vid4 datasets, demonstrating SATeCo's effectiveness. The results, summarized in Table 1, exhibit SATeCo's superiority in perception-based metrics, such as LPIPS and DISTS, when compared to competing methods like VRT and StableSR. Notably, the approach achieves comparable PSNR and SSIM scores alongside enhanced perceptual quality, affirming the model's ability to bridge the gap between traditional regression models and diffusion-based super-resolution models. Figure 3

Figure 3: Six visual examples of video super-resolution results by different approaches on the REDS4 and Vid4 datasets.

Comparative Analysis and Model Evaluation

The integration of SFA and TFA within the SATeCo framework is found to significantly improve both spatial and temporal feature learning. As highlighted in Table 2, variants of the SATeCo model, incorporating only the SFA and TFA modules sequentially within UNet and VAE, show incremental performance gains over the baseline, culminating in superior overall results for the complete SATeCo implementation. Figure 4

Figure 4: Video super-resolution results of two videos in the Vid4 dataset demonstrating temporal consistency.

Conclusion

SATeCo effectively advances the state of video super-resolution by leveraging spatial adaptation and temporal coherence mechanisms within diffusion models. By capitalizing on the pixel-wise learning of spatial guidance and temporal alignment from LR videos, SATeCo achieves high-quality HR video synthesis with improved spatial and temporal consistency. Future research directions may explore further optimization of the guiding mechanisms or integration with other generative models to enhance performance across diverse datasets.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.