ESGaussianFace: Audio-Driven Facial Animation

Updated 12 January 2026

ESGaussianFace is a framework for creating high-fidelity, audio-driven facial animations with emotional and stylistic modulation using 3D Gaussian Splatting.
It employs a multi-stage training procedure and spatial attention modules to achieve real-time synthesis with robust lip synchronization and emotion accuracy.
The method demonstrates superior image fidelity and emotion classification metrics, outperforming existing 2D and 3D neural rendering baselines.

ESGaussianFace is a technical framework for emotional and stylized audio-driven facial animation, leveraging a 3D Gaussian Splatting backbone for 3D-consistent, high-fidelity, real-time talking head synthesis. It incorporates audio-driven emotion and style modulation using spatial attention, explicit deformation modules, and a staged training procedure. The method achieves state-of-the-art perceptual quality, lip synchronization, emotion accuracy, and style transfer capabilities, and offers efficient multi-view rendering and robust control over both expression and artistic stylization (Ma et al., 5 Jan 2026).

1. 3D Gaussian Splatting Representation in ESGaussianFace

ESGaussianFace represents a canonical 3D face as a set of $N$ anisotropic Gaussian splats:

$\mathcal G_{\mathrm{can}} = \{ (\mu_i, \Sigma_i, \alpha_i, \text{sh}_i)\ |\ i=1\ldots N \}$

where $\mu_i \in \mathbb R^3$ is the 3D center, $\Sigma_i \in \mathbb R^{3 \times 3}$ the spatial covariance (determining anisotropic extent), $\alpha_i \in [0,1]$ the opacity weight, and $\text{sh}_i$ the coefficients for view-dependent radiance encoded in a spherical harmonic basis.

Each Gaussian’s contribution at point $x$ and view direction $\omega$ is given by:

$g_i(x) = \exp\!\left(-\tfrac{1}{2} (x - \mu_i)^\top \Sigma_i^{-1} (x - \mu_i)\right)$

$c_i(x, \omega) = \text{SH}(\text{sh}_i, \omega) \cdot g_i(x)$

Rendering is performed by projecting the splats into the image plane, sorting by depth, and performing alpha-compositing:

$\mathcal G_{\mathrm{can}} = \{ (\mu_i, \Sigma_i, \alpha_i, \text{sh}_i)\ |\ i=1\ldots N \}$ 0

where $\mathcal G_{\mathrm{can}} = \{ (\mu_i, \Sigma_i, \alpha_i, \text{sh}_i)\ |\ i=1\ldots N \}$ 1 is the projected per-pixel opacity, computed via the projected 2D covariance.

This explicit representation offers real-time, 3D-consistent, and viewpoint-invariant facial synthesis. Modulating the underlying splats allows parametric animation, stylization, and flexible expression control across frames and cameras (Ma et al., 5 Jan 2026).

2. Emotion-Audio-Guided Spatial Attention

To enable emotion- and style-dependent facial dynamics, ESGaussianFace applies spatial attention mechanisms conditioned on audio and emotion codes. Two streams of features are computed:

Audio features: Frame-wise audio is encoded by a pretrained DeepSpeech network, yielding $\mathcal G_{\mathrm{can}} = \{ (\mu_i, \Sigma_i, \alpha_i, \text{sh}_i)\ |\ i=1\ldots N \}$ 2 for audio context.
Emotion features: Expression is encoded from image or video frames using a 3DMM extractor, providing a 64-dimensional code $\mathcal G_{\mathrm{can}} = \{ (\mu_i, \Sigma_i, \alpha_i, \text{sh}_i)\ |\ i=1\ldots N \}$ 3, a blink intensity $\mathcal G_{\mathrm{can}} = \{ (\mu_i, \Sigma_i, \alpha_i, \text{sh}_i)\ |\ i=1\ldots N \}$ 4 (AU45), and pose vector $\mathcal G_{\mathrm{can}} = \{ (\mu_i, \Sigma_i, \alpha_i, \text{sh}_i)\ |\ i=1\ldots N \}$ 5.

These are fused into the per-Gaussian tri-plane features using a stack of $\mathcal G_{\mathrm{can}} = \{ (\mu_i, \Sigma_i, \alpha_i, \text{sh}_i)\ |\ i=1\ldots N \}$ 6 "Emotion–Audio-guided Spatial Attention Modules" (ESAMs). Each module consists of cross-attention layers that alternately attend to audio and emotion features, followed by feed-forward blocks. After $\mathcal G_{\mathrm{can}} = \{ (\mu_i, \Sigma_i, \alpha_i, \text{sh}_i)\ |\ i=1\ldots N \}$ 7 rounds, the updated embedding $\mathcal G_{\mathrm{can}} = \{ (\mu_i, \Sigma_i, \alpha_i, \text{sh}_i)\ |\ i=1\ldots N \}$ 8 encodes spatial, auditory, and emotional context for each splat:

Audio-guided attention:

$\mathcal G_{\mathrm{can}} = \{ (\mu_i, \Sigma_i, \alpha_i, \text{sh}_i)\ |\ i=1\ldots N \}$ 9

Emotion-guided attention (layer after audio):

$\mu_i \in \mathbb R^3$ 0

Feed-forward residual:

$\mu_i \in \mathbb R^3$ 1

This bidirectional attention enables fine spatiotemporal fusion of linguistic and emotional visual cues (Ma et al., 5 Jan 2026).

3. 3D Gaussian Deformation Predictors

Deformation modules modulate the parameters of each 3D Gaussian to realize audio-driven lip dynamics, emotion-specific expressions, and artistic style:

Positional Encoding: Each splat’s position $\mu_i \in \mathbb R^3$ 2 is embedded via a small MLP for spatial localization.
Emotion Deformation Predictor ( $\mu_i \in \mathbb R^3$ 3): Takes the concatenated embedding $\mu_i \in \mathbb R^3$ 4 and regresses offsets in Gaussian parameters (center, scale, rotation, color, opacity):

$\mu_i \in \mathbb R^3$ 5

yielding deformed Gaussians $\mu_i \in \mathbb R^3$ 6.

Style Deformation Predictor ( $\mu_i \in \mathbb R^3$ 7): Incorporates a 128-D style code $\mu_i \in \mathbb R^3$ 8 from a reference style image (encoded via StyleGAN). Concatenates $\mu_i \in \mathbb R^3$ 9 and predicts additional offsets:

$\Sigma_i \in \mathbb R^{3 \times 3}$ 0

The final stylized, emotional face is rendered by $\Sigma_i \in \mathbb R^{3 \times 3}$ 1.

This multi-stage deformation process allows seamless blending of audio, emotion, and style attributes in the 3D parametric representation (Ma et al., 5 Jan 2026).

4. Multi-Stage Learning Procedure

Jointly training all modules is unstable, so ESGaussianFace employs a progressive, three-stage strategy:

Stage 1 (Neutral + Lip Synchronization): Train the canonical generator and $\Sigma_i \in \mathbb R^{3 \times 3}$ 2 on neutral videos, using a loss:

$\Sigma_i \in \mathbb R^{3 \times 3}$ 3

These terms enforce RGB reconstruction, perceptual similarity, SSIM, mouth-region accuracy, landmark matching, and temporal smoothness.

Stage 2 (Emotion Variation): Freeze generator, train fusion and $\Sigma_i \in \mathbb R^{3 \times 3}$ 4 on videos with varying emotion at fixed audio, enforcing emotion-sensitive but accurate deformation.
Stage 3 (Style Integration): Generate ground-truth stylized videos via VToonify( $\Sigma_i \in \mathbb R^{3 \times 3}$ 5). Train $\Sigma_i \in \mathbb R^{3 \times 3}$ 6 with a style-alignment loss:

$\Sigma_i \in \mathbb R^{3 \times 3}$ 7

This sequential schedule isolates the learning of lip-sync, emotion, and style to avoid entanglement and improve stability (Ma et al., 5 Jan 2026).

5. Experimental Protocol and Results

Datasets: MEAD (8 emotions, 60 actors) and RAVDESS (8 emotions, 24 actors) for emotion/wild audio-video. Style exemplars come from high-resolution portrait image sets.

Baselines: 2D-pixel methods (MakeItTalk, Wav2Lip, Audio2Head), emotion-driven 2D methods (EAMM, EAT, DreamTalk, EDTalk), 3D NeRF/3DGS (AD-NeRF, SyncTalk, GaussianTalker, Style²Talker).

Metrics:

PSNR↑, SSIM↑, FID↓, LPIPS↓ (image fidelity)
Sync↑ (SyncNet confidence, lip-sync)
LMD↓ (landmark distance)
Acc $\Sigma_i \in \mathbb R^{3 \times 3}$ 8↑ (emotion classification accuracy)
FPS (inference speed)

Quantitative Findings (MEAD):

Method	PSNR	SSIM	FID	LPIPS	Sync	LMD	Acc $\Sigma_i \in \mathbb R^{3 \times 3}$ 9	FPS
MakeItTalk	27.60	0.60	53.09	0.059	4.50	5.21	19.38	15
Wav2Lip	27.82	0.67	46.37	0.051	5.85	5.07	16.80	15
GaussianTalker	28.91	0.82	22.29	0.038	5.20	4.98	52.62	77
ESGaussianFace	31.87	0.90	16.93	0.028	6.22	2.83	75.94	69

Ablation Studies: Removal of modules (e.g., ESAM, position embedding, expression code, multi-stage schedule) consistently degrades all performance metrics, demonstrating the necessity of the full architecture.

Qualitative Observations:

Produces sharp lips, accurate emotion-specific details (eyes/eyebrows), temporally stable, and style-accurate results.
Supports smooth interpolation between any pair of emotion or style codes, enabling continuous control of expression and appearance.

6. Comparative Context and Technical Impact

ESGaussianFace is the first 3D Gaussian Splatting system to combine emotional and style-conditioned facial animation in an audio-driven, multi-view-consistent setting. In contrast to methods that use only pixel-based approaches or static 3D morphable models, ESGaussianFace allows direct parametric manipulation of the underlying scene representation at the level of geometry and appearance, retaining real-time synthesis capability.

Its technical innovations—audio+emotion fusion via spatial attention, per-Gaussian deformation modules, and staged training—define new best practices in dynamic, stylized facial synthesis. The design achieves substantially higher fidelity, lip synchronization, and emotion accuracy than both 2D and 3D neural rendering baselines, and supports interactive stylization and frame-accurate control (Ma et al., 5 Jan 2026).

A plausible implication is that the principles of ESGaussianFace (modular deformation prediction, attention-based modality fusion, explicit splatting representation) can be extended beyond facial animation to controllable avatars, AR/VR telepresence, data augmentation, and high-fidelity character generation for virtual environments.

References

ESGaussianFace: Emotional and Stylized Audio-Driven Facial Animation via 3D Gaussian Splatting (Ma et al., 5 Jan 2026)
GaussianSwap: Animatable Video Face Swapping with 3D Gaussian Splatting (Cheng et al., 9 Jan 2026)
3D Gaussian Head Avatars with Expressive Dynamic Appearances by Compact Tensorial Representations (Wang et al., 21 Apr 2025)
Generating Editable Head Avatars with 3D Gaussian GANs (Li et al., 2024)
SVG-Head: Hybrid Surface-Volumetric Gaussians for High-Fidelity Head Reconstruction and Real-Time Editing (Sun et al., 13 Aug 2025)
Efficient Label Refinement for Face Parsing Under Extreme Poses Using 3D Gaussian Splatting (Gahlawat et al., 9 Oct 2025)
CaricatureGS: Exaggerating 3D Gaussian Splatting Faces With Gaussian Curvature (Matmon et al., 6 Jan 2026)