DicFace: Dirichlet-Constrained Variational Codebook Learning for Temporally Coherent Video Face Restoration

Published 16 Jun 2025 in cs.CV | (2506.13355v1)

Abstract: Video face restoration faces a critical challenge in maintaining temporal consistency while recovering fine facial details from degraded inputs. This paper presents a novel approach that extends Vector-Quantized Variational Autoencoders (VQ-VAEs), pretrained on static high-quality portraits, into a video restoration framework through variational latent space modeling. Our key innovation lies in reformulating discrete codebook representations as Dirichlet-distributed continuous variables, enabling probabilistic transitions between facial features across frames. A spatio-temporal Transformer architecture jointly models inter-frame dependencies and predicts latent distributions, while a Laplacian-constrained reconstruction loss combined with perceptual (LPIPS) regularization enhances both pixel accuracy and visual quality. Comprehensive evaluations on blind face restoration, video inpainting, and facial colorization tasks demonstrate state-of-the-art performance. This work establishes an effective paradigm for adapting intensive image priors, pretrained on high-quality images, to video restoration while addressing the critical challenge of flicker artifacts. The source code has been open-sourced and is available at https://github.com/fudan-generative-vision/DicFace.

Abstract PDF Upgrade to Chat

Summary

The paper introduces DicFace, which reformulates discrete codebook representations as Dirichlet-distributed continuous variables to enable smooth temporal transitions.
It integrates a spatio-temporal Transformer and Laplacian-constrained loss, yielding significant improvements in PSNR and LPIPS metrics.
Experiments on benchmarks like VFHQ confirm DicFace’s effectiveness in mitigating flickers and enhancing spatial consistency in video face restoration.

DicFace: Dirichlet-Constrained Variational Codebook Learning for Temporally Coherent Video Face Restoration

Introduction

The paper "DicFace: Dirichlet-Constrained Variational Codebook Learning for Temporally Coherent Video Face Restoration" introduces a novel approach to video face restoration that emphasizes maintaining temporal consistency while enhancing fine facial details from degraded video inputs. This research extends the functionality of Vector-Quantized Variational Autoencoders (VQ-VAEs), pretrained on static high-quality images, to a video restoration setting using a variational latent space. The authors propose reformulating discrete codebook representations into Dirichlet-distributed continuous variables, empowering seamless transitions between frames and mitigating temporal flickers. This is achieved through a spatio-temporal Transformer architecture that models inter-frame dependencies and predicts latent distributions. This architecture is robustly regularized by a Laplacian-constrained reconstruction loss combined with perceptual LPIPS metrics enhancing quality and accuracy.

Framework Overview

The authors' framework efficiently processes a sequence of degraded video frames via three core components: a spatial feature extraction encoder, a spatio-temporal Transformer for dependency modeling, and a decoder that reconstructs high-quality frames from the composite mixture of learnable codebook entries.

Figure 1: Overview of the DicFace framework with encoder, Transformer, and decoder components processing low-quality frames.

The Transformer architecture captures both spatial and temporal dependencies while predicting parameters for the Dirichlet distribution representing the mixture of latent codes, enhancing temporal coherence through smooth transitions over the Dirichlet manifold. The continuous formulation is regularized by an ELBO objective, balancing reconstruction fidelity with coherence.

Methodology

The authors elaborate on their innovative methodologies through various subsections, outlining the vector-quantized autoencoder framework that reconstructs images from latent feature maps, transitioning from discrete to continuous embeddings under a probabilistic paradigm enhanced by Dirichlet distribution properties.

Figure 2: Facial restoration model exhibits improved performance over state-of-the-art single-task solutions.

This reformulation allows for smoother transitions in video frame sequences, substantially reducing flickers associated with frame-to-frame variations by embedding discrete codebook representations within a variational framework.

Results and Comparisons

DicFace has been rigorously evaluated against contemporary benchmark datasets like VFHQ, proving its efficacy across tasks of blind face restoration, inpainting, and colorization. It exhibits significant performance improvements, showcased by higher PSNR and lower LPIPS scores, underscoring its ability to deliver high-quality restorations with superior temporal consistency.

Figure 3: Qualitative comparison with state-of-the-art methods, demonstrating superior detail recovery and temporal consistency across challenging conditions.

Temporary Stability and Invariance

A noteworthy advancement in this paper is the robust temporal stability achieved by DicFace, affirmed by enhanced TLME metrics and qualitative evaluations that illustrate its superior performance amid occlusions and dynamic poses.

Figure 4: Comparison of temporal stability via MSE at facial landmarks, emphasizing improved consistency with DicFace.

Ablation Studies

Extensive ablation studies validate the efficacy of the central methodological innovations, emphasizing the importance of Dirichlet-based variational modeling in ensuring temporal coherence across video frames.

Figure 5: Mitigated temporal jitter in restoration, highlighting DicFace's enhancement of spatial consistency and temporal continuity.

Conclusion

The work establishes a formidable framework for leveraging pretrained image priors into video settings, addressing long-standing challenges of flicker artifact mitigation in video face restoration. By uniting discrete codebook principles with continuous video dynamics through a principled, probabilistic approach, DicFace sets a new benchmark for coherent and high-fidelity video enhancements, offering promising directions for further research in generative video models.

Markdown Report Issue