HyperReenact: One-Shot Face Reenactment

Updated 3 February 2026

HyperReenact is a neural face reenactment framework that uses a hypernetwork on top of StyleGAN2 to achieve one-shot, artifact-free pose retargeting.
It employs a curriculum learning strategy through inversion, self, and cross reenactment phases to progressively refine identity features and desired facial pose.
Experimental evaluations on VoxCeleb benchmarks demonstrate state-of-the-art performance in fidelity, identity preservation, and robustness under extreme pose variations.

HyperReenact is a neural face reenactment framework designed to generate realistic talking head images of a source identity, driven by a target facial pose (including 3D head orientation and expression), from a single source frame (one-shot) without requiring subject-specific fine-tuning. HyperReenact directly leverages the photorealistic generation and disentanglement properties of a pretrained StyleGAN2 generator, in conjunction with a hypernetwork that jointly refines identity details and retargets facial pose. This approach eliminates reliance on external latent-code editing methods that often produce artifacts, particularly in the presence of extreme head pose variations. HyperReenact establishes state-of-the-art performance for both self and cross-subject reenactment on standard video benchmarks, demonstrating robustness and fidelity under challenging conditions (Bounareli et al., 2023).

1. Problem Formulation and Motivation

The one-shot reenactment task is defined as follows: given a single source frame $I_s$ and a target frame $I_t$ (potentially from different identities), synthesize a new image $I_r$ that preserves the identity of $I_s$ but exhibits the facial pose of $I_t$ . Previous state-of-the-art approaches can be categorized into encoder-decoder/flow-based architectures, which often produce artifacts under significant pose changes, and GAN-based methods that operate in the latent space but suffer from the “reconstruction–editability trade-off” when subjected to global edits such as large head rotations.

HyperReenact addresses these limitations by (1) using a pretrained StyleGAN2 generator whose extended latent space $\mathcal{W}^+$ is known to support disentanglement of identity and pose and (2) introducing a task-specific hypernetwork that adapts the generator weights to simultaneously refine identity features and achieve pose retargeting, without the need for subject-specific fine-tuning or external editing modules.

2. Methodological Framework

HyperReenact follows a curriculum learning schedule spanning three stages:

Phase 1 (Real-Image Inversion): Using identical source and target frames ( $I_s = I_t$ ), the method learns to recover missing identity features that are typically lost during off-the-shelf inversion (e.g., using e4e) by updating StyleGAN2 weights.
Phase 2 (Self-Reenactment): Training proceeds with pairs from the same subject that differ in pose ( $I_s$ and $I_t$ ), teaching the model to transfer pose while preserving identity.
Phase 3 (Cross-Subject Reenactment): Distinct-identity pairs are used to suppress identity leakage from the target and support robust cross-subject transfer.

The StyleGAN2 generator $\mathcal{G}$ and inversion encoder $\mathcal{E}_w$ remain frozen, with learning restricted to a Reenactment Module (RM) and a hypernetwork $\mathcal{H}$ that predict multiplicative updates to the generator’s convolution weights. This design enables HyperReenact to perform realistic reenactment across substantial pose differences without observable artifacts.

3. Latent Inversion and Hypernetwork Architecture

For a real image $I \in \{I_s, I_t\}$ , e4e is used to obtain a latent code in $\mathcal{W}^+ \subset \mathbb{R}^{N \times 512}$ , which is then used with the frozen generator to reconstruct the image. Rather than editing latent codes directly—which is unreliable for extreme pose modification—HyperReenact leverages a hypernetwork that modifies selected convolutional layers of StyleGAN2.

Key architectural elements:

Feature Extraction:
- Source identity features: 7×7 feature map ( $f_{app} \in \mathbb{R}^{512 \times 7 \times 7}$ ) from ArcFace ( $\mathcal{E}_{app}$ ).
- Target pose features: 7×7 feature map ( $f_{p} \in \mathbb{R}^{2048 \times 7 \times 7}$ ) from DECA ( $\mathcal{E}_p$ ).
Feature Fusion (Reenactment Module): Features are blended through learned per-channel scale–shift parameters ( $\gamma, \beta$ ), in the style of SPADE, producing a joint driving feature $f_r$ .
Reenactment Blocks (RBs): The hypernetwork comprises $M = 13$ RBs (out of the $N = 20$ StyleGAN2 convolutional layers), each predicting a kernel-shaped offset $\Delta\theta_\ell \in \mathbb{R}^{C_\ell^{out} \times C_\ell^{in} \times k_\ell \times k_\ell}$ applied multiplicatively:

$\hat\theta_\ell = \theta_\ell (1 + \Delta\theta_\ell)$

This allows smooth interpolation between original and fully adapted weights.

Image Synthesis: The reenacted image is synthesized as:

$I_r = \mathcal{G}(\mathbf{w}_s; \{\hat\theta_\ell\})$

where the updated weights encode both identity refinement and pose retargeting, achieving artifact-free reenactment even under extreme pose shifts.

4. Optimization and Training Protocol

HyperReenact’s end-to-end training (with $\mathcal{G}, \mathcal{E}_w, \mathcal{E}_{app}, \mathcal{E}_p$ frozen) uses a composite loss function embodying pixel-level, perceptual, identity, shape, and gaze consistency terms. Specifically:

Pixel-wise $\ell_1$ loss ( $\mathcal{L}_{pix}$ ): penalizes reconstruction errors.
LPIPS loss ( $\mathcal{L}_{lpips}$ ): encourages perceptual similarity.
Identity loss ( $\mathcal{L}_{id}$ ): cosine embedding similarity using ArcFace features.
Shape (3DMM) loss ( $\mathcal{L}_{sh}$ ): penalizes 3D facial shape mismatches via DECA.
Gaze loss ( $\mathcal{L}_g$ ): ensures gaze direction consistency.

The full objective in the inversion phase is: $\mathcal{L}_{inv} = \lambda_{pix}\mathcal{L}_{pix} + \lambda_{lpips}\mathcal{L}_{lpips} + \lambda_{id}\mathcal{L}_{id} + \lambda_{g}\mathcal{L}_{g}$ with additional terms added for later phases. Hyperparameters are set as $\lambda_{pix}=10$ , $\lambda_{lpips}=5$ , $\lambda_{id}=10$ , $\lambda_{sh}=0.5$ , and $\lambda_{g}=2$ .

5. Experimental Evaluation and Benchmarks

Experiments are conducted on VoxCeleb1 and VoxCeleb2 at $256\times256$ resolution, using the pretrained StyleGAN2 and e4e models tailored to VoxCeleb1. The effectiveness of HyperReenact is evaluated with the following metrics:

Metric	Description
CSIM	Cosine similarity of ArcFace embeddings (identity)
LPIPS	Learned perceptual similarity (reconstruction)
FID / FVD	Fréchet distances for realism/temporal consistency
APD	Average pose distance (degrees)
AED	Average expression distance

HyperReenact is compared to baselines such as X2Face, FOMM, Neural, Fast BL, PIR, LSR, FD, LIA, Dual, Rome, StyleHEAT, and StyleMask. Notable results include:

Self-Reenactment (VoxCeleb1): Highest CSIM (0.71), best APD (0.5°), and best AED (5.1); among the top two in LPIPS, FID, and FVD.
Cross-Subject Reenactment: Highest CSIM (0.68), best APD (0.5°), and best AED (equivalent to best baseline).
Extreme Pose Transfer (>15°): Highest CSIM (0.58 vs. 0.53) and lowest APD (0.9° vs. 1.1°).
User Study (30 participants, 20 self and 20 cross pairs): HyperReenact is preferred in 67.8% of cases, significantly higher than competitors.

6. Ablation Studies, Insights, and Limitations

Ablation experiments demonstrate:

Curriculum Learning: Progression from inversion to self, then cross reenactment, yields an increase in CSIM by 0.02 and reduces APD/AED by approximately 0.2°/1.6.
Cross-Subject Fine-Tuning: Incorporating mixed cross-subject batches boosts cross-subject CSIM from 0.53 to 0.68.
Gaze Loss: Consistently reduces gaze-direction errors by 0.1–0.15 radians.

Key limitations include reduced fidelity for rare accessories (e.g., hats, eyeglasses) and inaccurate background details, likely due to their underrepresentation in the VoxCeleb datasets. The approach inherits StyleGAN2’s fixed background and lighting assumptions, constraining variability in those domains. HyperReenact is thus most suited to scenarios where background and illumination conditions match the StyleGAN2 training distribution.

7. Concluding Perspective and Theoretical Implications

HyperReenact represents the first approach to one-shot face reenactment that jointly refines source identity inversion and retargets pose by learning to update the weights of a frozen StyleGAN2 via a dedicated hypernetwork. This design delivers state-of-the-art quantitative results across benchmarks and provides robustness to extreme pose variations, all without any per-identity fine-tuning. The methodology provides a framework for subsequent work in generative manipulation by combining disentangled latent representations, hypernetworks for conditional editing, and curriculum learning protocols (Bounareli et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and Retarget Faces (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HyperReenact.