Papers
Topics
Authors
Recent
Search
2000 character limit reached

BEAT2: Speech-Driven Gesture Benchmark

Updated 2 February 2026
  • BEAT2 is a comprehensive multimodal dataset that integrates synchronized 3D full-body, hand, and facial motion with high-fidelity audio and precise text/phoneme alignments.
  • It employs a unified SMPL-X and FLAME representation with advanced preprocessing, motion synchronization, and quantitative benchmarking protocols.
  • The dataset supports robust gesture synthesis research with standardized splits, detailed per-frame annotations, and extensive evaluation benchmarks.

BEAT2 is a large-scale, multimodal motion-capture dataset specifically designed for speech-driven holistic gesture research, providing synchronized 3D full-body, hand, and facial motion, high-fidelity audio, and precise text/phoneme alignments. It extends earlier benchmarks by unifying mesh-based SMPL-X body representation with FLAME face modeling, supporting fine-grained study and generation of full-body co-speech motion, and has become the standard for quantitative, community-driven evaluation in the field of automated gesture generation.

1. Dataset Scope, Scale, and Structure

BEAT2 comprises high-quality motion capture recordings of English conversational speech, originally featuring 30 speakers and a total duration of approximately 76 hours. The widely adopted “BEAT2–Standard” subset, as emphasized in EMAGE (Liu et al. 2024) and FastTalker, selects data from 25 subjects (12 female, 13 male), offering 1,762 sequences averaging 65.66 seconds each, for a total of about 60 hours—this subset partitions into approximately 27 hours of “standard” speech and 33 hours of “additional” conversational recordings.

Recording occurred in a Vicon-style optical motion-capture studio (78 markers, 120 Hz for body), synchronizing a close-talk microphone (audio at 16 kHz) and, in the “EMAGE” protocol, iPhone ARKit-based facial capture at 60 Hz. All principal modalities—body, face, hands, speech—are hardware-timestamped and temporally aligned. Each sequence is provided as a continuous conversational monologue, with typical topics such as hobbies, and durations from a few seconds up to tens of seconds per clip (Guo et al., 2024, Liu et al., 2023, Sha et al., 26 Jan 2026).

2. Modalities, Representation, and Annotation

Body and Face

The backbone mesh representation fuses SMPL-X (body: root translation, 55-joint axis-angle pose or 6D rotation vectors, finger joints) with FLAME (head: 100-dim facial expression plus 3D jaw pose). The combination is registered in a canonical T-pose. Fitting is performed with MoSh++ and includes corrections for physiological and topological plausibility (neck length, finger constraints, temporal smoothing over ±10 frames) (Liu et al., 2023). Mesh topology is standardized: SMPL-X (10,475 vertices, ~20k faces), FLAME (5,023 vertices, ~9.6k faces).

Audio and Speech Alignment

Audio is stored as 16 kHz raw waveform, aligned per frame with the motion data. Speech annotations use forced alignment via Montreal Forced Aligner, providing word and phoneme boundaries as TextGrid files. For modeling, the extracted phoneme sequence and frame-level durations (log-scaled) serve as supervised targets; pitch is extracted through Continuous Wavelet Transform and energy as per-frame L2-norm of the STFT amplitude (Guo et al., 2024).

Gesture Encoding and Quantization

All joint rotations and FLAME parameters are encoded per frame. For gesture modeling, continuous motion channels can be quantized using VQ-VAE codebooks (256 indices × 256-dimensional vectors (Guo et al., 2024)), or pose is converted to Rot6D encoding for deep learning pipelines (Liu et al., 2023, Sha et al., 26 Jan 2026). Each frame further includes four binary foot-contact labels and per-frame semantic annotations (“SimLabel”: topic, gesture category) (Sha et al., 26 Jan 2026).

Table: Core Data Channels and Sampling Rates

Modality Representation Rate
Body SMPL-X (axis-angle/6D) 120/60/30 Hz†
Face FLAME (100+3 PCA) 60/30 Hz†
Audio 16 kHz waveform 16,000 Hz
Phonemes/words Align. boundaries per frame

† Standard body/face sequences stored or decoded at 30–60 Hz, depending on downstream tasks.

3. Preprocessing and Alignment Protocols

Preprocessing follows a series of established pipelines. Text transcripts are cleaned, phonetized, and aligned via MFA, yielding phoneme durations, which are regressed in log\log-domain during model training. Audio is resampled and temporally aligned to gesture frames by sk=sr,audio/fpsgesturess_k = s_{r,\textrm{audio}} / \textrm{fps}_{\textrm{gestures}} (Liu et al., 2023). MoSh++ fitting minimizes sums of marker and surface distances, priors, velocity constraints, and tissue softening.

Facial blendshape weights captured via ARKit are fit to FLAME parameters via linear regression (WW optimized such that bFLAME=WbARKitb_{\mathrm{FLAME}} = W^\top b_{\mathrm{ARKit}} minimizes vFLAME(bFLAME)vARKit2\|v_{\mathrm{FLAME}}(b_{\mathrm{FLAME}}) - v_{\mathrm{ARKit}}\|_2).

Rigid motion/face-body alignment and mesh smoothing are applied to guarantee artifact-free rendering. Segments with large mesh penetration, self-intersection, or “flickering” are excluded before evaluation (Nagy et al., 3 Nov 2025).

4. Splitting, Storage, and Data Access

Standard splits partition the dataset (on subject-wise basis) into 85% train, 7.5% validation, and 7.5% test (Liu et al., 2023, Sha et al., 26 Jan 2026). Some works reference an 80/10/10 split for the 26h “standard” BEAT2 subset (Guo et al., 2024), but 85/7.5/7.5 is canonical for the 27h+33h full corpus. Segment selections for human evaluation frequently involve specific slicing (e.g., sampling 108 sentence-long test segments, 7.7–12.0 s; mean ≈ 10.7 s, encompassing all speakers) (Nagy et al., 3 Nov 2025).

Core files include:

  • body_params.npz: SMPL-X body parameters (shape, pose, translation)
  • head_params.npz: FLAME face parameters (expression, jaw pose)
  • Audio: .wav files (aligned, 16 kHz)
  • TextGrid: word/phoneme boundaries
  • Optionally: per-frame mesh exports (.ply/.obj)

Access is via https://pantomatrix.github.io/EMAGE/ under a CC BY-NC-SA research-only license (Liu et al., 2023).

5. Benchmarking Protocols and Community Impact

BEAT2 has established itself as the anchor benchmark for large-scale, comparative studies of speech-driven gesture synthesis. Models such as FastTalker, 3DGesPolicy, and EMAGE all use BEAT2-standard splits for quantitative benchmarking. The data’s mesh-level annotation enables head–face–body co-speech modeling at a fidelity not previously available (Liu et al., 2023, Guo et al., 2024, Sha et al., 26 Jan 2026).

Human evaluation is rigorously protocolized (Nagy et al., 3 Nov 2025): for “motion realism,” video pairs (audio muted) are compared using Elo scoring over pairwise rankings; for “speech–gesture alignment,” identical motion is paired with matched/mismatched audio, and weighted vote proportions quantify alignment. Key resources—5+ hours of generated SMPL-X outputs, 750+ identically rendered videos, open-source Blender visualization scripts—are released to standardize future evaluation and remove the need for model reimplementation (Nagy et al., 3 Nov 2025).

6. Research Applications and Modeling Practices

BEAT2 supports a range of state-of-the-art architectures for text-to-gesture, speech-to-gesture, and multitask generation, including transformer-based masked autoencoding (EMAGE), diffusion-policy-driven action control (3DGesPolicy), and end-to-end multimodal speech-gesture synthesis (FastTalker).

Models leverage all available modality alignments—audio rhythms/content, phoneme-bounded text, body/face motions (as Rot6D or VQ-VAE quantized codes), and additional per-frame semantic annotations—for cross-modal and temporal learning. Formal loss functions for pitch, energy, and duration (e.g., Lduration=log(d)Md(Pd(fpho))22L_{duration} = \|\log(d) - M_{d}(P_d(f_{pho}))\|_2^2) are standardized across works (Guo et al., 2024). Benchmarking reveals the necessity for disentangled metrics: apparent advances in motion realism do not always correspond to improved speech–motion alignment, emphasizing BEAT2’s role in measuring these axes independently (Nagy et al., 3 Nov 2025).

7. Licensing, Limitations, and Ongoing Developments

BEAT2 is released for non-commercial research use, with citation requirements to the EMAGE CVPR 2024 publication (Liu et al., 2023). The current release omits detailed per-session hardware metadata, cross-corpus normalization, and manual utterance/gesture segmentations beyond sentence-level evaluation slices. Five subjects with noisy finger data are excluded from the “standard” release.

A plausible implication is that future expansions or forks may target increased demographic diversity, multimodal sensor integration, or finer segmentation of interaction dynamics. The dataset’s mesh foundation and standardized benchmarks have positioned it as a central resource for methodological innovation and reproducible comparison in holistic speech-driven gesture generation.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BEAT2 Dataset.