Social-MAE: Transformer Autoencoders for Social Data

Updated 7 January 2026

Social-MAE is a transformer-based masked autoencoder framework that learns robust social behavior representations from audiovisual and multi-person motion data.
The design leverages high token masking and cross-modal fusion, enhancing data efficiency and generalization in complex social perception tasks.
Practical implementations demonstrate state-of-the-art performance in emotion recognition, pose forecasting, social grouping, and action understanding.

Social-MAE refers to a family of transformer-based masked autoencoder frameworks designed for learning representations of social behaviors in both multimodal (face and voice) and multi-person motion contexts. These frameworks leverage self-supervised masked modeling to address the challenge of limited annotated data in higher-level social tasks such as emotion recognition, laughter detection, apparent personality estimation, multi-person pose forecasting, social grouping, and social action understanding. Notably, the name "Social-MAE" is used independently in two prominent streams: audiovisual modeling for social perception (Bohy et al., 24 Aug 2025) and motion-based social representation learning (Ehsanpour et al., 2024). Both approaches share the underlying principle of learning robust social representations by reconstructing masked input signals, but are instantiated for different data modalities and downstream objectives.

1. Architectural Foundations and Modalities

The Social-MAE model for synchronized face and voice input (Bohy et al., 24 Aug 2025) is a three-stage transformer-based autoencoder based on an extension of the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE). Its architecture employs a mid-fusion scheme:

Modality encoders: Separate stacks (11 transformer layers each) process audio (log-Mel spectrogram) and video (face-cropped frames) independently.
Tokenization: Audio is converted to log-Mel spectrograms (128×1024) and patchified into 512 16×16 tokens; video is segmented into eight 224×224 RGB frames, patchified into spatiotemporal cubes (1568 tokens).
Joint encoder and decoder: A joint encoder (1 transformer layer) fuses modality outputs; an 8-layer joint decoder reconstructs both modalities from the pooled latent state and mask tokens.
Embedding details: Each token receives a 768-dimensional linear projection, learned positional embedding, and a learned modality embedding. All transformer blocks employ 768-dim embedding, 12 self-attention heads (64 dim/head), and MLPs with inner dimension 3072.

Masking is independently applied to each modality at a high random ratio (75%), with only 25% of tokens visible, encouraging inter-modality inference.

For multi-person motion, Social-MAE (Ehsanpour et al., 2024) processes joint trajectories over time:

Input: A sequence of T frames for N individuals, each with J body joints. Joint trajectories are normalized to remove global translation by referencing the pelvis joint at the final frame.
Frequency domain encoding: Each trajectory is projected into the frequency domain using a discrete cosine transform (DCT), retaining K coefficients for each trajectory.
Embedding: Tokens are enhanced with joint-type, person-identity, and global-position embeddings (all 1024-dim).
Masking (“tube masking”): For a randomly selected 50% of (person, joint) pairs, the entire DCT trajectory is masked, enforcing cross-person and cross-joint dependency.
Transformer encoder/decoder: The encoder (6 layers, 8 heads, 1024-dim) operates only on unmasked tokens. The decoder (3 layers, 4 heads, 1024-dim) processes encoder output plus learnable mask tokens to reconstruct the full trajectory set.

2. Self-Supervised Pre-training Objectives and Losses

Pre-training combines two objectives (Bohy et al., 24 Aug 2025):

Masked reconstruction loss: For masked tokens, the mean squared error (MSE) between reconstructed and original embeddings:

$L_\text{rec} = \frac{1}{|Ω_{\text{mask}}|} \sum_{i\in Ω_{\text{mask}}} \lVert \hat M_i - x_i \rVert^2$

InfoNCE-style contrastive loss: Enforces alignment of average-pooled unimodal latent representations between audio ( $c^a$ ) and video ( $c^v$ ):

$L_\text{contra} = -\frac{1}{N} \sum_{i=1}^N \log \frac{\exp\left(\text{sim}(c^v_i, c^a_i)/\tau\right)}{\sum_{j=1}^N \exp\left(\text{sim}(c^v_i, c^a_j)/\tau\right)}$

with τ = 0.1.

Unified loss: Total pre-training objective is $L = L_\text{rec} + λ_c L_\text{contra}$ , with $λ_c = 1.0$ .

The only pre-training loss is a reconstruction MSE on masked DCT trajectories (Ehsanpour et al., 2024):

$L_\text{rec} = \frac{1}{|M|} \sum_{(n,j)\in M} \lVert X_{1:T}^{n,j} - \hat X_{1:T}^{n,j}\rVert^2$

No contrastive or adversarial losses are included.

3. Pre-training Data and Protocols

Dataset: VoxCeleb2—1M utterances from over 6,000 speakers and 145 nationalities.
Augmentations: Random horizontal flips, random cropping/scaling for videos. No specAugment due to MAE token masking.
Optimization: 25 epochs, AdamW (β₁=0.9, β₂=0.999, ϵ=1e-8), batch size 128, learning rate 1e-4 decaying by 0.5 every 5 epochs, weight decay 0.05.
Masking rationale: High (75%) random token masking is justified to foster cross-token and inter-modal structural learning.

Datasets: Pre-trained on combined 3DPW, BMLmovi, BMLrub, and CMU MoCap, covering both 2D and 3D pose.
Optimization: Pre-training for 800 epochs, Adam (lr = 1e-4 decaying to 1e-5).
Efficiency: Tube masking reduces encoder attention cost by half; the shallow decoder and DCT reduction of token count support efficient end-to-end learning. S-MAE remains more data-efficient and robust to limited labeled data during fine-tuning, especially for long-term prediction tasks.

4. Fine-tuning Protocols and Downstream Task Configurations

For all downstream tasks, the 8-layer decoder is removed. The encoders and joint encoder are retained, with a task-specific linear head added. No masking is applied during fine-tuning.

Downstream tasks and metrics:

Task	Dataset	Epochs	Batch	Metrics	Best Social-MAE Score
Emotion Recognition	CREMA-D	20	8	F1 (Mi/Ma)	AV Mi 0.837, AV Ma 0.842
Apparent Personality Estimation	ChaLearn FI	10	8	Acc.(1–MAE)	Avg. 0.903*
Smile & Laughter Detection	NDC-ME	10	8	F1	Audio+Video: 0.776, Audio: 0.546, Video: 0.728*

*: Statistically significant (p < 1e-5) improvement over all baselines for emotion and laughter tasks.

After pre-training, the decoder is replaced by a downstream-specific head and the encoder is fine-tuned end-to-end for 256 epochs with Adam (lr=1e-3 → 1e-4).

Downstream tasks:

Pose forecasting: Predict future DCT coefficients for all joints of all people, using a token-wise linear regressor. Aggregate auxiliary losses from all encoder layers.
Social grouping: Predicts group assignment via binary affinity matrices and estimates the number of social groups using pooled person embeddings and MLPs, supervised via BCE and Laplacian eigenvalue count penalties.
Social action understanding: Multi-label per-person action labels (“pose-based”, Softmax; and “interaction”, Sigmoid).

5. Empirical Results and State-of-the-Art Performance

Emotion Recognition (CREMA-D): Social-MAE achieves multimodal (AV) F1-scores of 0.837 (micro) and 0.842 (macro), surpassing all baselines and previous SOTA, with large gains attributed to the in-domain pre-training and multi-frame video encoding.
Apparent Personality Estimation (ChaLearn FI): Social-MAE achieves an average accuracy of 90.3%, marginally below the SOTA of 91.3% (NJU-LAMDA), but with tenfold fewer epochs; multi-frame input provides substantive improvement for 4 of 5 Big-5 traits.
Smile & Laughter Detection (NDC-ME): Outperforms supervised and CAV-MAE baselines in all settings; AV F1-score increases by 18 points over the baseline.
Ablations demonstrate that the use of multiple video frames and re-pre-training on social video data are crucial for both loss reduction and downstream performance improvements.
Zero-shot reconstruction on unseen datasets yields plausible masked face reconstructions, indicating robust learned representations (see Fig. 2 of original).

Pose forecasting (SoMoF, CMU-Mocap, MuPoTS-3D): Achieves new SOTA results (e.g., overall VIM 48.6 vs. prior 49.4; reductions in MPJPE at all time horizons). Pre-training improves robustness and generalization, particularly at long horizons and limited data.
Social grouping (JRDB-Act): Best mean AP (32.4) on test, outperforming prior methods and scratch-trained baselines.
Social action detection (JRDB-Act): Outperforms prior work in mAP (11.2 test), with a +1.3 mAP advantage over scratch.
Ablations confirm the tube-masking optimality at 50%, decoder depth of 3, and the positive effect of comprehensive pre-training data coverage.
Data efficiency: Outperforms scratch with substantially fewer fine-tuning samples.

6. Distinctive Characteristics, Insights, and Practical Considerations

Representation learning: Both frameworks demonstrate that masked modeling—whether on audiovisual tokens or DCT-encoded multi-person trajectories—yields representations with strong transferability to high-level, socially relevant downstream tasks.
Masking and fusion strategy: High random mask ratios and mid-level fusion architectures (for multimodal data) or tube masking (for multi-person motion) are central for enforcing the exploitation of inter-token and cross-modality/cross-individual dependencies.
Data efficiency: Pre-training with Social-MAE configurations markedly improves performance in low-supervision and long-horizon scenarios.
Practical deployment: Efficient encoder-decoder splits, frequency-domain reductions, and relatively shallow transformer stacks keep training and inference costs competitive with corresponding baseline architectures.
Generalization: Representations learned via Social-MAE are robust as evidenced by zero-shot reconstructions and transfer performance across diverse social perception and behavioral analysis tasks.

Social-MAE, in both multimodal and multi-person instantiations, provides a unified strategy for social perception, achieving state-of-the-art results with minimal task-specific architectural modifications during fine-tuning and establishing the value of masked autoencoding for social representation learning in complex human-centered scenarios (Bohy et al., 24 Aug 2025, Ehsanpour et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice (2025)

Social-MAE: Social Masked Autoencoder for Multi-person Motion Representation Learning (2024)

Topic to Video (Beta)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Social-MAE.

Social-MAE: Transformer Autoencoders for Social Data

1. Architectural Foundations and Modalities

Audiovisual Multimodal Social-MAE

Multi-person Motion-based Social-MAE

2. Self-Supervised Pre-training Objectives and Losses

Audiovisual Social-MAE

Motion-based Social-MAE

3. Pre-training Data and Protocols

Audiovisual Social-MAE

Motion-based Social-MAE

4. Fine-tuning Protocols and Downstream Task Configurations

Audiovisual Social-MAE

Downstream tasks and metrics:

Motion-based Social-MAE

Downstream tasks:

5. Empirical Results and State-of-the-Art Performance

Audiovisual Social-MAE (Bohy et al., 24 Aug 2025)

Motion-based Social-MAE (Ehsanpour et al., 2024)

6. Distinctive Characteristics, Insights, and Practical Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics