3DGesPolicy: Unified Mesh Gesture Synthesis

Updated 2 February 2026

3DGesPolicy is a unified framework combining SMPL-X and FLAME models to represent full-body, hand, and facial dynamics for co-speech gesture synthesis.
It utilizes the BEAT2 dataset with multimodal motion capture to generate high-fidelity meshes and enable benchmarked training of holistic gesture generation models.
The framework supports innovative applications in co-speech animation, avatar control, and expressive virtual agents by connecting audio inputs with spatially detailed gesture outputs.

3DGesPolicy denotes methodologies and datasets enabling unified, mesh-level modeling for full-body 3D co-speech gesture synthesis and analysis, with explicit parameterization of body, hand, and facial dynamics. In contemporary gesture generation, accurate representation of spatial-temporal correlations between articulated body movements and expressive speech is a central objective. The BEAT2 (BEAT-SMPLX-FLAME) dataset, as employed by EMAGE, establishes a comprehensive resource and protocol for encoding high-fidelity multimodal motion data, facilitating benchmarked training and evaluation for holistic gesture generation and related downstream applications (Liu et al., 2023).

1. Foundations and Core Components

BEAT2 integrates the SMPL-X and FLAME models—two distinct, widely-adopted parameterizations—into a unified mesh-based representation for each frame. The SMPL-X component encodes 55 joint rotations (axis-angle or 6D) for the full body and hands, alongside a 3D root translation, supporting nuanced modeling of local and global body dynamics. FLAME provides 100 expression coefficients and 3 jaw parameters, controlling a detailed facial mesh of 5,023 vertices. The integration is realized by driving SMPL-X’s linear-blend skinning ( $\beta$ , $\theta$ ), and feeding FLAME’s head decoder $F(\cdot)$ with the corresponding expression vector $\epsilon$ and jaw configuration, producing a watertight mesh per frame that is suitable for direct use in machine learning models and graphics pipelines.

2. Data Collection and Processing Pipeline

The BEAT2 corpus is constructed via multi-modal motion capture, combining marker-based body tracking and facial blendshape inference. Recordings utilize a Vicon 16-camera system (78 markers, 60 Hz), yielding sub-millimeter tracking error, paired with Apple iPhone 12 Pro ARKit facial capture (51 FACS blendshape weights at 60 Hz). MoSh++ is employed to fit a 300-dimensional SMPL-X shape vector and per-frame pose to the sparse markers, minimizing a differentiable surface-to-marker loss and enforcing anthropometric consistency (e.g., head/neck ratio $\approx$ 1/7 body height, finger dorsiflexion [0°, 90°], joint outlier truncation and smoothing). FLAME expression parameters are optimized from ARKit weights using handcrafted templates and a learned linear map $W \in \mathbb{R}^{51 \times 103}$ , minimizing $L_2$ vertex error for each frame.

Cleaning involves censoring noisy finger data (reducing the dataset from 76 h to 60 h), visual inspection for ARKit dropouts, and foot-skate cleanup using a pretrained Kovar foot-sliding predictor. This multi-stage approach ensures both mesh fidelity and consistency across diverse speakers and motion types.

3. Dataset Structure, Parameterization, and Formats

BEAT2 comprises 1,762 sequences (totaling 60 hours) from 25 speakers (13 M / 12 F, ages 20–50, mixed ethnicities), with average sequence length of 65.7 seconds ( $\sim$ 3,942 frames per clip). Parameter files are structured as follows:

File	Data Type	Key Contents
body_params.npz	NumPy array (.npz)	betas [ $T \times 300$ ], poses [ $T \times 55 \times 3$ ], trans [ $T \times 3$ ]
head_params.npz	NumPy array (.npz)	expr [ $T \times 100$ ], jaw [ $T \times 3$ ]
contacts.npy	NumPy array (.npy)	Foot contact labels [ $T \times 4$ ]
transcript.json	JSON	Aligned text, word-level timestamps

Parameter conventions follow: global translation in camera space (Y-up), joint rotations in parent-joint frame, axes X (right), Y (up), Z (forward). Frame-level gesture "masks" are annotated for masked-reconstruction tasks. Data splits for the Standard set are 85% train, 7.5% validation, 7.5% test.

4. Statistical and Technical Overview

Key dataset metrics:

Statistic	Value
Total duration	60 hours
Speakers	25 (13 M / 12 F)
Sequences	1,762
Avg. seq length	65.7 sec ( $\sim$ 3,942 frames)
SMPL-X pose dims	55 joints $\times$ 3 = 165
SMPL-X shape dims $\beta$	300 (PCA)
FLAME expr dims $\epsilon$	100
FLAME jaw dims	3
FLAME mesh verts	5,023

A plausible implication is that this parameterization enables direct training of transformer and VQ-VAE architectures on mesh-level motion, connecting audio and textual modalities with spatially precise gesture synthesis.

5. Applications and Example Usage

The unified dataset structure facilitates end-to-end learning for holistic gesture generation, co-speech animation, avatar control, and multimodal linguistic-motion analysis without retargeting or format conversion. Researchers can efficiently reconstruct meshes using the SMPL-X and FLAME Python APIs. For instance:

import numpy as np
from smplx import SMPLX, FLAME

data = np.load('speaker_01/seq_0001/body_params.npz')
betas    = data['betas']    # [T x 300]
poses    = data['poses']    # [T x 55 x 3]
trans    = data['trans']    # [T x 3]

head = np.load('speaker_01/seq_0001/head_params.npz')
expr = head['expr']         # [T x 100]
jaw  = head['jaw']          # [T x 3]

smplx_model = SMPLX(model_path='models/smplx', gender='neutral')
flame_model= FLAME(model_path='models/flame')

meshes = []
for t in range(poses.shape[0]):
    out_body = smplx_model(betas=betas[t],
                           body_pose=poses[t],
                           global_orient=poses[t,0],
                           transl=trans[t])
    out_head = flame_model(expression=expr[t], jaw_pose=jaw[t])
    body_verts = out_body.vertices.detach().cpu().numpy()
    head_verts = out_head.vertices.detach().cpu().numpy()
    meshes.append((body_verts, head_verts))

This suggests that mesh-level synthesis can be flexibly customized and extended for downstream tasks, including audio-synchronized result generation leveraging spatial-temporal gesture input.

6. Licensing, Access, and Community Impact

BEAT2 is released under the CC-BY-4.0 license for research use. The dataset is distributed via university servers and AWS S3; total size approximates 350 GB. Public access is provided through https://pantomatrix.github.io/EMAGE/, with required attribution to Liu et al., “EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling” (CVPR 2024) (Liu et al., 2023). The community-standardized, textured-mesh format is intended to catalyze reproducible research, benchmark establishment, and method comparison in the field of co-speech gesture generation.

7. Research Significance and Future Directions

The 3DGesPolicy exemplified by BEAT2 and EMAGE establishes a new normative protocol for mesh-level, multimodal, and holistic co-speech datasets. By combining SMPL-X and FLAME in a single parameterization, the approach supports the joint modeling of facial expressions, body motion, hand articulation, and global translation with high temporal and spatial resolution. A plausible implication is that such datasets will drive next-generation research in expressive virtual agents, embodied conversational AI, and multimodal communication analysis, enabling further exploration of context-dependent gesture synthesis and multimodal representation learning without format fragmentation or the limitations of legacy skeleton-based corpora. This framework is positioned to facilitate systematized progress in gesture-based social intelligence, avatar personalization, and parameter-efficient modeling of human communicative motion.

Markdown Report Issue Upgrade to Chat

References (1)

EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to 3DGesPolicy.