Papers
Topics
Authors
Recent
Search
2000 character limit reached

ESGaussianFace: Audio-Driven Facial Animation

Updated 12 January 2026
  • ESGaussianFace is a framework for creating high-fidelity, audio-driven facial animations with emotional and stylistic modulation using 3D Gaussian Splatting.
  • It employs a multi-stage training procedure and spatial attention modules to achieve real-time synthesis with robust lip synchronization and emotion accuracy.
  • The method demonstrates superior image fidelity and emotion classification metrics, outperforming existing 2D and 3D neural rendering baselines.

ESGaussianFace is a technical framework for emotional and stylized audio-driven facial animation, leveraging a 3D Gaussian Splatting backbone for 3D-consistent, high-fidelity, real-time talking head synthesis. It incorporates audio-driven emotion and style modulation using spatial attention, explicit deformation modules, and a staged training procedure. The method achieves state-of-the-art perceptual quality, lip synchronization, emotion accuracy, and style transfer capabilities, and offers efficient multi-view rendering and robust control over both expression and artistic stylization (Ma et al., 5 Jan 2026).

1. 3D Gaussian Splatting Representation in ESGaussianFace

ESGaussianFace represents a canonical 3D face as a set of NN anisotropic Gaussian splats:

Gcan={(μi,Σi,αi,shi)  i=1N}\mathcal G_{\mathrm{can}} = \{ (\mu_i, \Sigma_i, \alpha_i, \text{sh}_i)\ |\ i=1\ldots N \}

where μiR3\mu_i \in \mathbb R^3 is the 3D center, ΣiR3×3\Sigma_i \in \mathbb R^{3 \times 3} the spatial covariance (determining anisotropic extent), αi[0,1]\alpha_i \in [0,1] the opacity weight, and shi\text{sh}_i the coefficients for view-dependent radiance encoded in a spherical harmonic basis.

Each Gaussian’s contribution at point xx and view direction ω\omega is given by:

gi(x)=exp ⁣(12(xμi)Σi1(xμi))g_i(x) = \exp\!\left(-\tfrac{1}{2} (x - \mu_i)^\top \Sigma_i^{-1} (x - \mu_i)\right)

ci(x,ω)=SH(shi,ω)gi(x)c_i(x, \omega) = \text{SH}(\text{sh}_i, \omega) \cdot g_i(x)

Rendering is performed by projecting the splats into the image plane, sorting by depth, and performing alpha-compositing:

Cpixel=i=1Nciαij<i(1αj)C_{\mathrm{pixel}} = \sum_{i=1}^N c_i \alpha_i' \prod_{j<i}(1 - \alpha_j')

where αi\alpha'_i is the projected per-pixel opacity, computed via the projected 2D covariance.

This explicit representation offers real-time, 3D-consistent, and viewpoint-invariant facial synthesis. Modulating the underlying splats allows parametric animation, stylization, and flexible expression control across frames and cameras (Ma et al., 5 Jan 2026).

2. Emotion-Audio-Guided Spatial Attention

To enable emotion- and style-dependent facial dynamics, ESGaussianFace applies spatial attention mechanisms conditioned on audio and emotion codes. Two streams of features are computed:

  • Audio features: Frame-wise audio is encoded by a pretrained DeepSpeech network, yielding atl:t+l\mathbf a^{t-l:t+l} for audio context.
  • Emotion features: Expression is encoded from image or video frames using a 3DMM extractor, providing a 64-dimensional code e\mathbf e, a blink intensity yy (AU45), and pose vector vR12v \in \mathbb R^{12}.

These are fused into the per-Gaussian tri-plane features using a stack of BB "Emotion–Audio-guided Spatial Attention Modules" (ESAMs). Each module consists of cross-attention layers that alternately attend to audio and emotion features, followed by feed-forward blocks. After BB rounds, the updated embedding zB\mathbf z_B encodes spatial, auditory, and emotional context for each splat:

  • Audio-guided attention:

zb=zb1+softmax(QKTd)V\mathbf z'_b = \mathbf z_{b-1} + \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) V

  • Emotion-guided attention (layer after audio):

zb=zb+softmax(QKTd)V\mathbf z''_b = \mathbf z'_b + \text{softmax}\left(\frac{Q' K'^T}{\sqrt{d}}\right) V'

  • Feed-forward residual:

zb=zb+FFN(zb)\mathbf z_b = \mathbf z''_b + \mathrm{FFN}(\mathbf z''_b)

This bidirectional attention enables fine spatiotemporal fusion of linguistic and emotional visual cues (Ma et al., 5 Jan 2026).

3. 3D Gaussian Deformation Predictors

Deformation modules modulate the parameters of each 3D Gaussian to realize audio-driven lip dynamics, emotion-specific expressions, and artistic style:

  • Positional Encoding: Each splat’s position μi\mu_i is embedded via a small MLP for spatial localization.
  • Emotion Deformation Predictor (PemoP_{\mathrm{emo}}): Takes the concatenated embedding [zB;e;p][\mathbf z_B; \mathbf e; \mathbf p] and regresses offsets in Gaussian parameters (center, scale, rotation, color, opacity):

[Δμ;Δs;Δq;Δsh;Δα]=Pemo([zB;e;p])[\Delta\mu;\, \Delta s; \, \Delta q; \, \Delta{\rm sh};\, \Delta\alpha] = P_{\mathrm{emo}}([\mathbf z_B; \mathbf e; \mathbf p])

yielding deformed Gaussians Gemo=Gcan+ΔGemo\mathcal G_{\mathrm{emo}} = \mathcal G_{\mathrm{can}} + \Delta\mathcal G_{\mathrm{emo}}.

  • Style Deformation Predictor (PstyP_{\mathrm{sty}}): Incorporates a 128-D style code s=Es(Is)\mathbf s=E_s(I_s) from a reference style image (encoded via StyleGAN). Concatenates [zB;e;p;s][\mathbf z_B; \mathbf e; \mathbf p; \mathbf s] and predicts additional offsets:

Gsty=Gemo+ΔGsty\mathcal G_{\mathrm{sty}} = \mathcal G_{\mathrm{emo}} + \Delta\mathcal G_{\mathrm{sty}}

  • The final stylized, emotional face is rendered by I^sty=R(Gsty)\hat I_{\mathrm{sty}} = \mathcal R(\mathcal G_{\mathrm{sty}}).

This multi-stage deformation process allows seamless blending of audio, emotion, and style attributes in the 3D parametric representation (Ma et al., 5 Jan 2026).

4. Multi-Stage Learning Procedure

Jointly training all modules is unstable, so ESGaussianFace employs a progressive, three-stage strategy:

  • Stage 1 (Neutral + Lip Synchronization): Train the canonical generator and PemoP_{\mathrm{emo}} on neutral videos, using a loss:

Lneu=λ1II^1+λ2Lper+λ3LSSIM+λ4Llip+λ5Lld+λ6LsmoL_{\mathrm{neu}} = \lambda_1 \|I - \hat I\|_1 + \lambda_2 L_{\mathrm{per}} + \lambda_3 L_{\mathrm{SSIM}} + \lambda_4 L_{\mathrm{lip}} + \lambda_5 L_{\mathrm{ld}} + \lambda_6 L_{\mathrm{smo}}

These terms enforce RGB reconstruction, perceptual similarity, SSIM, mouth-region accuracy, landmark matching, and temporal smoothness.

  • Stage 2 (Emotion Variation): Freeze generator, train fusion and PemoP_{\mathrm{emo}} on videos with varying emotion at fixed audio, enforcing emotion-sensitive but accurate deformation.
  • Stage 3 (Style Integration): Generate ground-truth stylized videos via VToonify(Iemo,IsI_{\mathrm{emo}}, I_s). Train PstyP_{\mathrm{sty}} with a style-alignment loss:

Lsty=Lneu+λexsEs(Isty)Es(I^sty)1L_{\mathrm{sty}} = L_{\mathrm{neu}} + \lambda_{\mathrm{exs}} \|E_s(I_{\mathrm{sty}}) - E_s(\hat I_{\mathrm{sty}})\|_1

This sequential schedule isolates the learning of lip-sync, emotion, and style to avoid entanglement and improve stability (Ma et al., 5 Jan 2026).

5. Experimental Protocol and Results

Datasets: MEAD (8 emotions, 60 actors) and RAVDESS (8 emotions, 24 actors) for emotion/wild audio-video. Style exemplars come from high-resolution portrait image sets.

Baselines: 2D-pixel methods (MakeItTalk, Wav2Lip, Audio2Head), emotion-driven 2D methods (EAMM, EAT, DreamTalk, EDTalk), 3D NeRF/3DGS (AD-NeRF, SyncTalk, GaussianTalker, Style²Talker).

Metrics:

  • PSNR↑, SSIM↑, FID↓, LPIPS↓ (image fidelity)
  • Sync↑ (SyncNet confidence, lip-sync)
  • LMD↓ (landmark distance)
  • Accemo_{\mathrm{emo}}↑ (emotion classification accuracy)
  • FPS (inference speed)

Quantitative Findings (MEAD):

Method PSNR SSIM FID LPIPS Sync LMD Accemo_{\mathrm{emo}} FPS
MakeItTalk 27.60 0.60 53.09 0.059 4.50 5.21 19.38 15
Wav2Lip 27.82 0.67 46.37 0.051 5.85 5.07 16.80 15
GaussianTalker 28.91 0.82 22.29 0.038 5.20 4.98 52.62 77
ESGaussianFace 31.87 0.90 16.93 0.028 6.22 2.83 75.94 69

Ablation Studies: Removal of modules (e.g., ESAM, position embedding, expression code, multi-stage schedule) consistently degrades all performance metrics, demonstrating the necessity of the full architecture.

Qualitative Observations:

  • Produces sharp lips, accurate emotion-specific details (eyes/eyebrows), temporally stable, and style-accurate results.
  • Supports smooth interpolation between any pair of emotion or style codes, enabling continuous control of expression and appearance.

6. Comparative Context and Technical Impact

ESGaussianFace is the first 3D Gaussian Splatting system to combine emotional and style-conditioned facial animation in an audio-driven, multi-view-consistent setting. In contrast to methods that use only pixel-based approaches or static 3D morphable models, ESGaussianFace allows direct parametric manipulation of the underlying scene representation at the level of geometry and appearance, retaining real-time synthesis capability.

Its technical innovations—audio+emotion fusion via spatial attention, per-Gaussian deformation modules, and staged training—define new best practices in dynamic, stylized facial synthesis. The design achieves substantially higher fidelity, lip synchronization, and emotion accuracy than both 2D and 3D neural rendering baselines, and supports interactive stylization and frame-accurate control (Ma et al., 5 Jan 2026).

A plausible implication is that the principles of ESGaussianFace (modular deformation prediction, attention-based modality fusion, explicit splatting representation) can be extended beyond facial animation to controllable avatars, AR/VR telepresence, data augmentation, and high-fidelity character generation for virtual environments.

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ESGaussianFace.