HyperReenact: One-Shot Face Reenactment
- HyperReenact is a neural face reenactment framework that uses a hypernetwork on top of StyleGAN2 to achieve one-shot, artifact-free pose retargeting.
- It employs a curriculum learning strategy through inversion, self, and cross reenactment phases to progressively refine identity features and desired facial pose.
- Experimental evaluations on VoxCeleb benchmarks demonstrate state-of-the-art performance in fidelity, identity preservation, and robustness under extreme pose variations.
HyperReenact is a neural face reenactment framework designed to generate realistic talking head images of a source identity, driven by a target facial pose (including 3D head orientation and expression), from a single source frame (one-shot) without requiring subject-specific fine-tuning. HyperReenact directly leverages the photorealistic generation and disentanglement properties of a pretrained StyleGAN2 generator, in conjunction with a hypernetwork that jointly refines identity details and retargets facial pose. This approach eliminates reliance on external latent-code editing methods that often produce artifacts, particularly in the presence of extreme head pose variations. HyperReenact establishes state-of-the-art performance for both self and cross-subject reenactment on standard video benchmarks, demonstrating robustness and fidelity under challenging conditions (Bounareli et al., 2023).
1. Problem Formulation and Motivation
The one-shot reenactment task is defined as follows: given a single source frame and a target frame (potentially from different identities), synthesize a new image that preserves the identity of but exhibits the facial pose of . Previous state-of-the-art approaches can be categorized into encoder-decoder/flow-based architectures, which often produce artifacts under significant pose changes, and GAN-based methods that operate in the latent space but suffer from the “reconstruction–editability trade-off” when subjected to global edits such as large head rotations.
HyperReenact addresses these limitations by (1) using a pretrained StyleGAN2 generator whose extended latent space is known to support disentanglement of identity and pose and (2) introducing a task-specific hypernetwork that adapts the generator weights to simultaneously refine identity features and achieve pose retargeting, without the need for subject-specific fine-tuning or external editing modules.
2. Methodological Framework
HyperReenact follows a curriculum learning schedule spanning three stages:
- Phase 1 (Real-Image Inversion): Using identical source and target frames (), the method learns to recover missing identity features that are typically lost during off-the-shelf inversion (e.g., using e4e) by updating StyleGAN2 weights.
- Phase 2 (Self-Reenactment): Training proceeds with pairs from the same subject that differ in pose ( and ), teaching the model to transfer pose while preserving identity.
- Phase 3 (Cross-Subject Reenactment): Distinct-identity pairs are used to suppress identity leakage from the target and support robust cross-subject transfer.
The StyleGAN2 generator and inversion encoder remain frozen, with learning restricted to a Reenactment Module (RM) and a hypernetwork that predict multiplicative updates to the generator’s convolution weights. This design enables HyperReenact to perform realistic reenactment across substantial pose differences without observable artifacts.
3. Latent Inversion and Hypernetwork Architecture
For a real image , e4e is used to obtain a latent code in , which is then used with the frozen generator to reconstruct the image. Rather than editing latent codes directly—which is unreliable for extreme pose modification—HyperReenact leverages a hypernetwork that modifies selected convolutional layers of StyleGAN2.
Key architectural elements:
- Feature Extraction:
- Source identity features: 7×7 feature map () from ArcFace ().
- Target pose features: 7×7 feature map () from DECA ().
- Feature Fusion (Reenactment Module): Features are blended through learned per-channel scale–shift parameters (), in the style of SPADE, producing a joint driving feature .
- Reenactment Blocks (RBs): The hypernetwork comprises RBs (out of the StyleGAN2 convolutional layers), each predicting a kernel-shaped offset applied multiplicatively:
This allows smooth interpolation between original and fully adapted weights.
- Image Synthesis: The reenacted image is synthesized as:
where the updated weights encode both identity refinement and pose retargeting, achieving artifact-free reenactment even under extreme pose shifts.
4. Optimization and Training Protocol
HyperReenact’s end-to-end training (with frozen) uses a composite loss function embodying pixel-level, perceptual, identity, shape, and gaze consistency terms. Specifically:
- Pixel-wise loss (): penalizes reconstruction errors.
- LPIPS loss (): encourages perceptual similarity.
- Identity loss (): cosine embedding similarity using ArcFace features.
- Shape (3DMM) loss (): penalizes 3D facial shape mismatches via DECA.
- Gaze loss (): ensures gaze direction consistency.
The full objective in the inversion phase is: with additional terms added for later phases. Hyperparameters are set as , , , , and .
5. Experimental Evaluation and Benchmarks
Experiments are conducted on VoxCeleb1 and VoxCeleb2 at resolution, using the pretrained StyleGAN2 and e4e models tailored to VoxCeleb1. The effectiveness of HyperReenact is evaluated with the following metrics:
| Metric | Description |
|---|---|
| CSIM | Cosine similarity of ArcFace embeddings (identity) |
| LPIPS | Learned perceptual similarity (reconstruction) |
| FID / FVD | Fréchet distances for realism/temporal consistency |
| APD | Average pose distance (degrees) |
| AED | Average expression distance |
HyperReenact is compared to baselines such as X2Face, FOMM, Neural, Fast BL, PIR, LSR, FD, LIA, Dual, Rome, StyleHEAT, and StyleMask. Notable results include:
- Self-Reenactment (VoxCeleb1): Highest CSIM (0.71), best APD (0.5°), and best AED (5.1); among the top two in LPIPS, FID, and FVD.
- Cross-Subject Reenactment: Highest CSIM (0.68), best APD (0.5°), and best AED (equivalent to best baseline).
- Extreme Pose Transfer (>15°): Highest CSIM (0.58 vs. 0.53) and lowest APD (0.9° vs. 1.1°).
- User Study (30 participants, 20 self and 20 cross pairs): HyperReenact is preferred in 67.8% of cases, significantly higher than competitors.
6. Ablation Studies, Insights, and Limitations
Ablation experiments demonstrate:
- Curriculum Learning: Progression from inversion to self, then cross reenactment, yields an increase in CSIM by 0.02 and reduces APD/AED by approximately 0.2°/1.6.
- Cross-Subject Fine-Tuning: Incorporating mixed cross-subject batches boosts cross-subject CSIM from 0.53 to 0.68.
- Gaze Loss: Consistently reduces gaze-direction errors by 0.1–0.15 radians.
Key limitations include reduced fidelity for rare accessories (e.g., hats, eyeglasses) and inaccurate background details, likely due to their underrepresentation in the VoxCeleb datasets. The approach inherits StyleGAN2’s fixed background and lighting assumptions, constraining variability in those domains. HyperReenact is thus most suited to scenarios where background and illumination conditions match the StyleGAN2 training distribution.
7. Concluding Perspective and Theoretical Implications
HyperReenact represents the first approach to one-shot face reenactment that jointly refines source identity inversion and retargets pose by learning to update the weights of a frozen StyleGAN2 via a dedicated hypernetwork. This design delivers state-of-the-art quantitative results across benchmarks and provides robustness to extreme pose variations, all without any per-identity fine-tuning. The methodology provides a framework for subsequent work in generative manipulation by combining disentangled latent representations, hypernetworks for conditional editing, and curriculum learning protocols (Bounareli et al., 2023).