One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution

Published 18 Jun 2025 in cs.CV and cs.AI | (2506.15591v2)

Abstract: It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at https://github.com/yjsunnn/DLoRAL.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces DLoRAL, a dual LoRA-based Real-VSR framework that decouples temporal consistency and spatial detail enhancement via one-step diffusion.
It employs a Cross-Frame Retrieval module to align adjacent frame features, ensuring robust temporal priors and efficient processing.
Experiments demonstrate state-of-the-art performance with faster inference (~10x improvement) and superior perceptual quality compared to existing methods.

This paper introduces Dual LoRA Learning (DLoRAL), a novel framework for Real-World Video Super-Resolution (Real-VSR) that aims to generate videos with rich spatial details while maintaining temporal consistency. The core challenge in Real-VSR, especially when using pre-trained generative models like Stable Diffusion (SD), is the trade-off between detail enhancement and temporal coherence. Existing methods often sacrifice one for the other. DLoRAL addresses this by decoupling the learning of temporal consistency and spatial details using a one-step diffusion model.

The proposed DLoRAL framework leverages a pre-trained SD model and introduces two specialized Low-Rank Adaptation (LoRA) modules: a Consistency-LoRA (C-LoRA) and a Detail-LoRA (D-LoRA). These modules are trained in an alternating, dual-stage process.

Key Components and Methodology:

One-Step Residual Diffusion: The system builds upon a one-step residual diffusion model. Instead of multiple denoising steps, it refines the low-quality (LQ) latent code $z^{LQ}$ to a high-quality (HQ) latent code $z^{HQ}$ in a single step using the formula: $z^{HQ} = z^{LQ} - \epsilon_\theta(z^{LQ})$ , where $\epsilon_\theta$ is the noise prediction network. This significantly speeds up inference.
Cross-Frame Retrieval (CFR) Module: To exploit temporal information from degraded LQ inputs, the CFR module aggregates complementary information from adjacent frames. For a current frame $I_n^{LQ}$ and its preceding frame $I_{n-1}^{LQ}$ , their latent codes $z_n^{LQ}$ and $z_{n-1}^{LQ}$ are processed. The CFR module first aligns $z_{n-1}^{LQ}$ to $z_n^{LQ}$ 's coordinate space using SpyNet ( $F_{wp}$ ). Then, using $1 \times 1$ convolutions, it projects $z_n^{LQ}$ to query ( $Q_n$ ) and the aligned $F_{wp}(z_{n-1}^{LQ})$ to key ( $K_{n-1}$ ) and value ( $V_{n-1}$ ) embeddings. The fusion mechanism selectively attends to the top-k most similar positions and uses a learnable threshold $\tau_n[p]$ for gating:

$\bar{z}^{LQ}_n[p] = z^{LQ}_n[p] + \sum_{q \in F_{topk}[p]} \phi \left( \frac{\langle Q_n[p], K_{n-1}[q] \rangle}{\sqrt{d} - \tau_n[p]} \right) \cdot V_{n-1}[q]$

This produces a temporally enriched LQ latent $\bar{z}^{LQ}_n$ .
Dual LoRA Modules:
- Consistency-LoRA (C-LoRA): This module, along with the CFR module, is trained during the "temporal consistency stage." It learns robust temporal representations from the fused LQ latent features $\bar{z}^{LQ}_n$ .
- Detail-LoRA (D-LoRA): This module is trained during the "detail enhancement stage." It focuses on restoring high-frequency spatial details.
Dual-Stage Alternating Training:
- Temporal Consistency Stage: The CFR and C-LoRA modules are trained while D-LoRA is frozen. The goal is to establish strong temporal coherence. The loss function $\mathcal{L}_{\text{cons}}$ includes pixel-level loss ( $\mathcal{L}_{\text{pix}}$ using $\ell_2$ ), LPIPS loss ( $\mathcal{L}_{\text{lpips}}$ ), and an optical flow loss ( $\mathcal{L}_{\text{opt}}$ ).
  
  $\mathcal{L}_{\text{cons}} = \lambda_{\text{pix}} \mathcal{L}_{\text{pix}} + \lambda_{\text{lpips}} \mathcal{L}_{\text{lpips}} + \lambda_{\text{opt}} \mathcal{L}_{\text{opt}}$
  
  $\mathcal{L}_{\text{opt}} = \left\| F(I^{HQ}_n, I^{HQ}_{n+1}) - F(I_n^{\text{GT}}, I_{n+1}^{\text{GT}}) \right\|_1$

* Detail Enhancement Stage: The CFR and C-LoRA modules are frozen, and D-LoRA is trained. The focus is on improving spatial visual quality while maintaining the learned consistency. The loss function $\mathcal{L}_{\text{enh}}$ includes the previous losses plus a Classifier Score Distillation (CSD) loss ( $\mathcal{L}_{\text{csd}}$ ) to encourage richer details.

$\mathcal{L}_{\text{enh}} = \lambda_{\text{pix}} \mathcal{L}_{\text{pix}} + \lambda_{\text{lpips}} \mathcal{L}_{\text{lpips}} + \lambda_{\text{opt}} \mathcal{L}_{\text{opt}} + \lambda_{\text{csd}} \mathcal{L}_{\text{csd}}$

These two stages are alternated iteratively. A smooth transition between stages is achieved by interpolating the loss functions over a warm-up period.

The overall training pipeline is visualized in Figure 1 of the paper: https://i.imgur.com/gKkCjUf.png Image from Figure 1 of the paper, illustrating the dual-stage training of CFR, C-LoRA, and D-LoRA.

Inference: During inference, both C-LoRA and D-LoRA are merged into the main SD UNet. The model processes LQ video frames (current frame $I_n^{LQ}$ and preceding frame $I_{n-1}^{LQ}$ ) using the CFR module and then the enhanced UNet in a single diffusion step to produce the HQ frame $I_n^{HQ}$ . This sliding-window approach processes the video sequence.

Implementation Details:

Backbone: Pre-trained Stable Diffusion V2.1.
Training: Batch size 16, sequence length 3, resolution $512 \times 512$ , on 4 NVIDIA A100 GPUs. Adam optimizer with learning rate $5\times10^{-5}$ .
Datasets:
- Consistency Stage: REDS dataset and curated videos from Pexels (44,162 frames).
- Enhancement Stage: LSDIR dataset, with simulated video sequences generated by random pixel-level translations.
- Degradation: RealESRGAN degradation pipeline (blur, noise, downsampling, compression).
Testing Datasets: UDM10, SPMCS (synthetic), RealVSR, VideoLQ (real-world).
Evaluation Metrics: PSNR, SSIM, LPIPS, DISTS, MUSIQ, MANIQA, CLIPIQA, DOVER, and average warping error ( $E^*_{warp}$ ) for temporal consistency.

Results and Contributions:

Performance: DLoRAL achieves state-of-the-art performance on Real-VSR benchmarks, outperforming existing methods in perceptual quality (LPIPS, DISTS, MUSIQ, CLIPIQA, MANIQA, DOVER) while maintaining good temporal consistency ( $E^*_{warp}$ ).
Efficiency: Due to the one-step diffusion and LoRA integration, DLoRAL is significantly faster (e.g., ~10x faster than Upscale-A-Video and MGLD-VSR) and has a comparable number of parameters to other efficient methods like OSEDiff.
Qualitative Results: Visual comparisons show DLoRAL produces sharper details, better facial reconstruction, and more legible textures compared to other methods, while temporal profiles indicate smoother transitions.
User Study: DLoRAL was overwhelmingly preferred by human evaluators (93 out of 120 votes) against three other diffusion-based Real-VSR methods for its balance of perceptual quality and temporal consistency.

Main Contributions:

A Dual LoRA Learning (DLoRAL) paradigm for Real-VSR that decouples temporal consistency and spatial detail learning into two dedicated LoRA modules within a one-step diffusion framework.
A Cross-Frame Retrieval (CFR) module to extract degradation-robust temporal priors from LQ inputs, guiding both C-LoRA and D-LoRA training.
State-of-the-art performance in Real-VSR, achieving both realistic details and temporal stability efficiently.

Limitations:

The $8\times$ downsampling VAE inherited from SD makes it difficult to restore very fine-scale details (e.g., small text).
The VAE's heavy compression might disrupt temporal coherence, making robust consistency prior extraction harder. The authors suggest a VAE specifically designed for Real-VSR could mitigate this.

In essence, DLoRAL provides a practical and effective approach to Real-VSR by cleverly managing the conflicting objectives of detail enhancement and temporal consistency through a dual LoRA architecture and a staged training strategy, all while ensuring efficient inference via a one-step diffusion process.

Markdown Report Issue