Papers
Topics
Authors
Recent
Search
2000 character limit reached

One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution

Published 18 Jun 2025 in cs.CV and cs.AI | (2506.15591v2)

Abstract: It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at https://github.com/yjsunnn/DLoRAL.

Summary

  • The paper introduces DLoRAL, a dual LoRA-based Real-VSR framework that decouples temporal consistency and spatial detail enhancement via one-step diffusion.
  • It employs a Cross-Frame Retrieval module to align adjacent frame features, ensuring robust temporal priors and efficient processing.
  • Experiments demonstrate state-of-the-art performance with faster inference (~10x improvement) and superior perceptual quality compared to existing methods.

This paper introduces Dual LoRA Learning (DLoRAL), a novel framework for Real-World Video Super-Resolution (Real-VSR) that aims to generate videos with rich spatial details while maintaining temporal consistency. The core challenge in Real-VSR, especially when using pre-trained generative models like Stable Diffusion (SD), is the trade-off between detail enhancement and temporal coherence. Existing methods often sacrifice one for the other. DLoRAL addresses this by decoupling the learning of temporal consistency and spatial details using a one-step diffusion model.

The proposed DLoRAL framework leverages a pre-trained SD model and introduces two specialized Low-Rank Adaptation (LoRA) modules: a Consistency-LoRA (C-LoRA) and a Detail-LoRA (D-LoRA). These modules are trained in an alternating, dual-stage process.

Key Components and Methodology:

  1. One-Step Residual Diffusion: The system builds upon a one-step residual diffusion model. Instead of multiple denoising steps, it refines the low-quality (LQ) latent code zLQz^{LQ} to a high-quality (HQ) latent code zHQz^{HQ} in a single step using the formula: zHQ=zLQϵθ(zLQ)z^{HQ} = z^{LQ} - \epsilon_\theta(z^{LQ}), where ϵθ\epsilon_\theta is the noise prediction network. This significantly speeds up inference.
  2. Cross-Frame Retrieval (CFR) Module: To exploit temporal information from degraded LQ inputs, the CFR module aggregates complementary information from adjacent frames. For a current frame InLQI_n^{LQ} and its preceding frame In1LQI_{n-1}^{LQ}, their latent codes znLQz_n^{LQ} and zn1LQz_{n-1}^{LQ} are processed. The CFR module first aligns zn1LQz_{n-1}^{LQ} to znLQz_n^{LQ}'s coordinate space using SpyNet (FwpF_{wp}). Then, using 1×11 \times 1 convolutions, it projects znLQz_n^{LQ} to query (QnQ_n) and the aligned Fwp(zn1LQ)F_{wp}(z_{n-1}^{LQ}) to key (Kn1K_{n-1}) and value (Vn1V_{n-1}) embeddings. The fusion mechanism selectively attends to the top-k most similar positions and uses a learnable threshold τn[p]\tau_n[p] for gating:

    zˉnLQ[p]=znLQ[p]+qFtopk[p]ϕ(Qn[p],Kn1[q]dτn[p])Vn1[q]\bar{z}^{LQ}_n[p] = z^{LQ}_n[p] + \sum_{q \in F_{topk}[p]} \phi \left( \frac{\langle Q_n[p], K_{n-1}[q] \rangle}{\sqrt{d} - \tau_n[p]} \right) \cdot V_{n-1}[q]

    This produces a temporally enriched LQ latent zˉnLQ\bar{z}^{LQ}_n.

  3. Dual LoRA Modules:
    • Consistency-LoRA (C-LoRA): This module, along with the CFR module, is trained during the "temporal consistency stage." It learns robust temporal representations from the fused LQ latent features zˉnLQ\bar{z}^{LQ}_n.
    • Detail-LoRA (D-LoRA): This module is trained during the "detail enhancement stage." It focuses on restoring high-frequency spatial details.
  4. Dual-Stage Alternating Training:
    • Temporal Consistency Stage: The CFR and C-LoRA modules are trained while D-LoRA is frozen. The goal is to establish strong temporal coherence. The loss function Lcons\mathcal{L}_{\text{cons}} includes pixel-level loss (Lpix\mathcal{L}_{\text{pix}} using 2\ell_2), LPIPS loss (Llpips\mathcal{L}_{\text{lpips}}), and an optical flow loss (Lopt\mathcal{L}_{\text{opt}}).

      Lcons=λpixLpix+λlpipsLlpips+λoptLopt\mathcal{L}_{\text{cons}} = \lambda_{\text{pix}} \mathcal{L}_{\text{pix}} + \lambda_{\text{lpips}} \mathcal{L}_{\text{lpips}} + \lambda_{\text{opt}} \mathcal{L}_{\text{opt}}

      Lopt=F(InHQ,In+1HQ)F(InGT,In+1GT)1\mathcal{L}_{\text{opt}} = \left\| F(I^{HQ}_n, I^{HQ}_{n+1}) - F(I_n^{\text{GT}}, I_{n+1}^{\text{GT}}) \right\|_1

* Detail Enhancement Stage: The CFR and C-LoRA modules are frozen, and D-LoRA is trained. The focus is on improving spatial visual quality while maintaining the learned consistency. The loss function Lenh\mathcal{L}_{\text{enh}} includes the previous losses plus a Classifier Score Distillation (CSD) loss (Lcsd\mathcal{L}_{\text{csd}}) to encourage richer details.

Lenh=λpixLpix+λlpipsLlpips+λoptLopt+λcsdLcsd\mathcal{L}_{\text{enh}} = \lambda_{\text{pix}} \mathcal{L}_{\text{pix}} + \lambda_{\text{lpips}} \mathcal{L}_{\text{lpips}} + \lambda_{\text{opt}} \mathcal{L}_{\text{opt}} + \lambda_{\text{csd}} \mathcal{L}_{\text{csd}}

These two stages are alternated iteratively. A smooth transition between stages is achieved by interpolating the loss functions over a warm-up period.

The overall training pipeline is visualized in Figure 1 of the paper: https://i.imgur.com/gKkCjUf.png Image from Figure 1 of the paper, illustrating the dual-stage training of CFR, C-LoRA, and D-LoRA.

  1. Inference: During inference, both C-LoRA and D-LoRA are merged into the main SD UNet. The model processes LQ video frames (current frame InLQI_n^{LQ} and preceding frame In1LQI_{n-1}^{LQ}) using the CFR module and then the enhanced UNet in a single diffusion step to produce the HQ frame InHQI_n^{HQ}. This sliding-window approach processes the video sequence.

Implementation Details:

  • Backbone: Pre-trained Stable Diffusion V2.1.
  • Training: Batch size 16, sequence length 3, resolution 512×512512 \times 512, on 4 NVIDIA A100 GPUs. Adam optimizer with learning rate 5×1055\times10^{-5}.
  • Datasets:
    • Consistency Stage: REDS dataset and curated videos from Pexels (44,162 frames).
    • Enhancement Stage: LSDIR dataset, with simulated video sequences generated by random pixel-level translations.
    • Degradation: RealESRGAN degradation pipeline (blur, noise, downsampling, compression).
  • Testing Datasets: UDM10, SPMCS (synthetic), RealVSR, VideoLQ (real-world).
  • Evaluation Metrics: PSNR, SSIM, LPIPS, DISTS, MUSIQ, MANIQA, CLIPIQA, DOVER, and average warping error (EwarpE^*_{warp}) for temporal consistency.

Results and Contributions:

  • Performance: DLoRAL achieves state-of-the-art performance on Real-VSR benchmarks, outperforming existing methods in perceptual quality (LPIPS, DISTS, MUSIQ, CLIPIQA, MANIQA, DOVER) while maintaining good temporal consistency (EwarpE^*_{warp}).
  • Efficiency: Due to the one-step diffusion and LoRA integration, DLoRAL is significantly faster (e.g., ~10x faster than Upscale-A-Video and MGLD-VSR) and has a comparable number of parameters to other efficient methods like OSEDiff.
  • Qualitative Results: Visual comparisons show DLoRAL produces sharper details, better facial reconstruction, and more legible textures compared to other methods, while temporal profiles indicate smoother transitions.
  • User Study: DLoRAL was overwhelmingly preferred by human evaluators (93 out of 120 votes) against three other diffusion-based Real-VSR methods for its balance of perceptual quality and temporal consistency.

Main Contributions:

  1. A Dual LoRA Learning (DLoRAL) paradigm for Real-VSR that decouples temporal consistency and spatial detail learning into two dedicated LoRA modules within a one-step diffusion framework.
  2. A Cross-Frame Retrieval (CFR) module to extract degradation-robust temporal priors from LQ inputs, guiding both C-LoRA and D-LoRA training.
  3. State-of-the-art performance in Real-VSR, achieving both realistic details and temporal stability efficiently.

Limitations:

  • The 8×8\times downsampling VAE inherited from SD makes it difficult to restore very fine-scale details (e.g., small text).
  • The VAE's heavy compression might disrupt temporal coherence, making robust consistency prior extraction harder. The authors suggest a VAE specifically designed for Real-VSR could mitigate this.

In essence, DLoRAL provides a practical and effective approach to Real-VSR by cleverly managing the conflicting objectives of detail enhancement and temporal consistency through a dual LoRA architecture and a staged training strategy, all while ensuring efficient inference via a one-step diffusion process.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.