One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution
Published 18 Jun 2025 in cs.CV and cs.AI | (2506.15591v2)
Abstract: It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at https://github.com/yjsunnn/DLoRAL.
The paper introduces DLoRAL, a dual LoRA-based Real-VSR framework that decouples temporal consistency and spatial detail enhancement via one-step diffusion.
It employs a Cross-Frame Retrieval module to align adjacent frame features, ensuring robust temporal priors and efficient processing.
Experiments demonstrate state-of-the-art performance with faster inference (~10x improvement) and superior perceptual quality compared to existing methods.
This paper introduces Dual LoRA Learning (DLoRAL), a novel framework for Real-World Video Super-Resolution (Real-VSR) that aims to generate videos with rich spatial details while maintaining temporal consistency. The core challenge in Real-VSR, especially when using pre-trained generative models like Stable Diffusion (SD), is the trade-off between detail enhancement and temporal coherence. Existing methods often sacrifice one for the other. DLoRAL addresses this by decoupling the learning of temporal consistency and spatial details using a one-step diffusion model.
The proposed DLoRAL framework leverages a pre-trained SD model and introduces two specialized Low-Rank Adaptation (LoRA) modules: a Consistency-LoRA (C-LoRA) and a Detail-LoRA (D-LoRA). These modules are trained in an alternating, dual-stage process.
Key Components and Methodology:
One-Step Residual Diffusion: The system builds upon a one-step residual diffusion model. Instead of multiple denoising steps, it refines the low-quality (LQ) latent code zLQ to a high-quality (HQ) latent code zHQ in a single step using the formula: zHQ=zLQ−ϵθ(zLQ), where ϵθ is the noise prediction network. This significantly speeds up inference.
Cross-Frame Retrieval (CFR) Module: To exploit temporal information from degraded LQ inputs, the CFR module aggregates complementary information from adjacent frames. For a current frame InLQ and its preceding frame In−1LQ, their latent codes znLQ and zn−1LQ are processed. The CFR module first aligns zn−1LQ to znLQ's coordinate space using SpyNet (Fwp). Then, using 1×1 convolutions, it projects znLQ to query (Qn) and the aligned Fwp(zn−1LQ) to key (Kn−1) and value (Vn−1) embeddings.
The fusion mechanism selectively attends to the top-k most similar positions and uses a learnable threshold τn[p] for gating:
Consistency-LoRA (C-LoRA): This module, along with the CFR module, is trained during the "temporal consistency stage." It learns robust temporal representations from the fused LQ latent features zˉnLQ.
Detail-LoRA (D-LoRA): This module is trained during the "detail enhancement stage." It focuses on restoring high-frequency spatial details.
Dual-Stage Alternating Training:
Temporal Consistency Stage: The CFR and C-LoRA modules are trained while D-LoRA is frozen. The goal is to establish strong temporal coherence. The loss function Lcons includes pixel-level loss (Lpix using ℓ2), LPIPS loss (Llpips), and an optical flow loss (Lopt).
Lcons=λpixLpix+λlpipsLlpips+λoptLopt
Lopt=F(InHQ,In+1HQ)−F(InGT,In+1GT)1
* Detail Enhancement Stage: The CFR and C-LoRA modules are frozen, and D-LoRA is trained. The focus is on improving spatial visual quality while maintaining the learned consistency. The loss function Lenh includes the previous losses plus a Classifier Score Distillation (CSD) loss (Lcsd) to encourage richer details.
These two stages are alternated iteratively. A smooth transition between stages is achieved by interpolating the loss functions over a warm-up period.
The overall training pipeline is visualized in Figure 1 of the paper:
https://i.imgur.com/gKkCjUf.png
Image from Figure 1 of the paper, illustrating the dual-stage training of CFR, C-LoRA, and D-LoRA.
Inference: During inference, both C-LoRA and D-LoRA are merged into the main SD UNet. The model processes LQ video frames (current frame InLQ and preceding frame In−1LQ) using the CFR module and then the enhanced UNet in a single diffusion step to produce the HQ frame InHQ. This sliding-window approach processes the video sequence.
Implementation Details:
Backbone: Pre-trained Stable Diffusion V2.1.
Training: Batch size 16, sequence length 3, resolution 512×512, on 4 NVIDIA A100 GPUs. Adam optimizer with learning rate 5×10−5.
Datasets:
Consistency Stage: REDS dataset and curated videos from Pexels (44,162 frames).
Enhancement Stage: LSDIR dataset, with simulated video sequences generated by random pixel-level translations.
Evaluation Metrics: PSNR, SSIM, LPIPS, DISTS, MUSIQ, MANIQA, CLIPIQA, DOVER, and average warping error (Ewarp∗) for temporal consistency.
Results and Contributions:
Performance: DLoRAL achieves state-of-the-art performance on Real-VSR benchmarks, outperforming existing methods in perceptual quality (LPIPS, DISTS, MUSIQ, CLIPIQA, MANIQA, DOVER) while maintaining good temporal consistency (Ewarp∗).
Efficiency: Due to the one-step diffusion and LoRA integration, DLoRAL is significantly faster (e.g., ~10x faster than Upscale-A-Video and MGLD-VSR) and has a comparable number of parameters to other efficient methods like OSEDiff.
Qualitative Results: Visual comparisons show DLoRAL produces sharper details, better facial reconstruction, and more legible textures compared to other methods, while temporal profiles indicate smoother transitions.
User Study: DLoRAL was overwhelmingly preferred by human evaluators (93 out of 120 votes) against three other diffusion-based Real-VSR methods for its balance of perceptual quality and temporal consistency.
Main Contributions:
A Dual LoRA Learning (DLoRAL) paradigm for Real-VSR that decouples temporal consistency and spatial detail learning into two dedicated LoRA modules within a one-step diffusion framework.
A Cross-Frame Retrieval (CFR) module to extract degradation-robust temporal priors from LQ inputs, guiding both C-LoRA and D-LoRA training.
State-of-the-art performance in Real-VSR, achieving both realistic details and temporal stability efficiently.
Limitations:
The 8× downsampling VAE inherited from SD makes it difficult to restore very fine-scale details (e.g., small text).
The VAE's heavy compression might disrupt temporal coherence, making robust consistency prior extraction harder. The authors suggest a VAE specifically designed for Real-VSR could mitigate this.
In essence, DLoRAL provides a practical and effective approach to Real-VSR by cleverly managing the conflicting objectives of detail enhancement and temporal consistency through a dual LoRA architecture and a staged training strategy, all while ensuring efficient inference via a one-step diffusion process.
“Emergent Mind helps me see which AI papers have caught fire online.”
Philip
Creator, AI Explained on YouTube
Sign up for free to explore the frontiers of research
Discover trending papers, chat with arXiv, and track the latest research shaping the future of science and technology.Discover trending papers, chat with arXiv, and more.