ES-MoE for Lifelong Empathic Motion Generation
- The paper introduces ES-MoE, a novel framework integrating causal-guided emotion decoupling with scenario-adapted mixture-of-experts to achieve continual emotional motion generation.
- It employs VQ-VAE tokenization, low-rank adaptation, and a dedicated gating network to efficiently learn and adapt to new emotional scenarios while minimizing forgetting.
- Quantitative evaluations show improved metrics such as FID, R-Precision, and reduced forgetting compared to existing multi-task and sequential adaptation methods.
Emotion-Transferable and Scenario-Adapted Mixture of Experts (ES-MoE) is a methodological paradigm for lifelong empathic motion generation and expressive style transfer. ES-MoE is designed to enable artificial agents, especially LLMs, to robustly encode, transfer, and generate emotional expression across a continually expanding set of scenarios—while avoiding catastrophic forgetting and preserving the fidelity of both seen and novel emotional cues. The ES-MoE framework unifies causal-guided emotion decoupling with scenario-adapted, expert-augmented parameterization, leveraging both mixture-of-experts (MoE) architectures and recent advances in low-rank adaptation for continual lifelong learning (Wang et al., 22 Dec 2025). Closely related work in expressive text-to-speech demonstrates MoE-based style encoding for cross-domain emotional and prosodic generalization (Jawaid et al., 2024).
1. Core Challenges in Lifelong Empathic Generation
The central challenges in articulated, human-centric motion and expressive generation are (a) emotion decoupling and (b) scenario adaptation, which underpin the requirements of generalization and memory retention in lifelong learning settings:
- Emotion Decoupling: Involves extracting and reusing shared affective signals (e.g., slumped posture for “sad”) that are invariant across scenario boundaries, rather than entangling these with scenario-specific movement idiosyncrasies. Failure to separate core emotional cues from contextual features compromises zero-shot transfer and leads to misinterpretation—e.g., reading emotionally neutral actions in novel domains as affective.
- Scenario Adaptation: Requires acquiring new, scenario-specific movement schemas (e.g., sports, dance, acrobatics) while preserving prior scenario knowledge, thus preventing catastrophic forgetting. Embodied agents must maintain previous emotional mappings (how “happy” or “sad” manifest in “daily life” versus “shows”) even as their behavioral repertoire expands.
These challenges are formalized within the LLM-Centric Lifelong Empathic Motion Generation (L2-EMG) task, where models are continually trained on a sequence of motion datasets differing in scenario and emotion, demanding both efficient knowledge uptake and robust transfer (Wang et al., 22 Dec 2025).
2. ES-MoE Architecture: Tokenization, Causal Decoupling, and Expert Adaptation
2.1 Motion Tokenization via VQ-VAE
A vector quantized variational autoencoder (VQ-VAE) transforms raw 3D joint sequences into discrete “motion tokens” , providing a symbolic interface for downstream LLMs. The VQ-VAE comprises an encoder , codebook , quantizer , and decoder , optimized by
where (reconstruction), (embedding), and (codebook commitment) enforce quantized, discrete latent structure amenable to compositional reasoning.
2.2 Causal-Guided Emotion Decoupling Block
The causal-guided emotion decoupling block (CGED) operationalizes front-door adjustment [Pearl 2018] to separate causal emotion features () from scenario-sensitive confounders ():
- Causal Structure (Fig. 3c): (input features) → (decoupled features) → (emotion label); .
- Front-Door Adjustment:
- NWGM Approximation: Attention-based reweighting yields
with computed from parametric projections and global cluster centroids (see Table 1 below). The emotion-constraint loss
imposes supervision on decoupled emotion representations.
| CGED Variable | Description | Mechanism |
|---|---|---|
| Input features (e.g. motion tokens) | Projected to queries/keys | |
| Decoupled emotion features | Parametric/probabilistic mapping | |
| Confounder variables (scenario/identity) | Cluster centers (K-means) | |
| Emotion label | Cross-entropy supervised |
2.3 Scenario-Adapted Expert Construction
For each scenario , a LoRA-based expert is constructed with low-rank parameter updates , aggregated into the forward model:
Here, is the shared backbone (e.g., LLaMA2), and the gating weight assigned to expert .
Gating Network
The gating network computes
with the emotion-highlighted input from CGED, an orthogonally-initialized expert key, and a down-up projection for key alignment. New scenario experts are added online; only the new are trainable on while predecessors are frozen (optionally stochastically masked).
2.4 Optimization and Lifelong Learning Protocol
Overall training integrates (stage 1) VQ-VAE quantizer loss and (stage 2) scenario-conditioned MoE loss
where is token/sequence modeling loss and enforces emotion discriminability in decoupled representations. Lifelong learning proceeds by incrementally instantiating new experts and recomputing gated aggregations as new scenarios are observed.
3. Datasets and Continual Learning Splits
ES-MoE is benchmarked on composite and scenario-varied datasets designed for lifelong empathic motion generation (Wang et al., 22 Dec 2025):
- Motion Tokenizer: Trained on the union of EmotionalT2M (with six basic emotions) and Motion-X, both containing text–motion pairs.
- Continual Scenario Training Set: Eight scenarios—Daily Life, Sports, Dance, Shows, Game, Animation, Instrument Play, Acrobatics—totaling 19,916 samples, each with labeled emotions (Sad, Angry, Happy, Fear, etc.).
- Training Regimes: Two splits:
- Unseen: Sequential fine-tuning to stress continual adaptation and retention.
- Mixed: Each subset augments a primary scenario with a smoothed fraction of data from other scenarios.
| Split | Description | Sequence |
|---|---|---|
| Unseen | Sequential scenario expansion | |
| Mixed | Scenario with fractions others | Each : primary + others |
Train/validation/test ratio per subset is 0.80/0.05/0.15.
4. Evaluation Metrics, Comparative Results, and Ablations
Comprehensive evaluation covers diversity, emotion transfer, and forgetting, using the following quantitative metrics:
- Average FID (AF): Lower is better. Measures distributional similarity to ground truth motions.
- R-Precision (AR): Higher indicates better alignment between generated and reference sequences.
- Diversity (AD) / Multimodality (AMM): Higher is desirable for diverse, multi-tone outputs.
- Emotion-Weighted F1 (AWF): Captures accuracy conditioned on emotion.
- Forgetting Rate (FR): Lower/negative is better–indicates maintenance of previous knowledge.
ES-MoE, using a LLaMA2 7B backbone and LoRA adaptation, surpasses multitask learning (MTL), sequential LoRA, lifelong fine-tuning (LwF-LoRA), EPI, O-LoRA, Prog-Prompt, and SAPT across most metrics (e.g., AF = 1.89 for ES-MoE vs. 2.12 for SAPT; FR = –1.03 vs. –0.54) (Wang et al., 22 Dec 2025).
Ablation studies indicate that removal of the CGED block or the emotion loss () results in significant drops in AWF and AR, while omitting scenario-based MoE (i.e., single LoRA expert) leads to markedly worse AF and FR—demonstrating necessity of both decoupling and expert modularity.
Qualitative motion visualizations and FID(i → j) heatmaps confirm that ES-MoE yields coordinated, emotion-aligned motions with nuanced, scenario-specific expressivity and minimal forgetting of prior domains.
5. Relation to Mixture of Experts in Expressive TTS
The ES-MoE approach builds conceptually on mixture-of-experts methods such as StyleMoE for expressive text-to-speech synthesis (Jawaid et al., 2024). In StyleMoE, style encoding is handled via MoE layers in which each expert models a portion of the style space (emotion, timbre, prosody), with a gating network performing sparse top- routing based on reference speech features. This design enables subspace specialization and improved zero-shot style transfer, even in unseen speech domains. Style embeddings, learned implicitly, govern emotional and prosodic dimensions without explicit emotion labels, paralleling ES-MoE’s emphasis on transferability and adaptation.
Both methods evaluate using a suite of intelligibility, quality, prosody, and style-fidelity metrics, and both demonstrate state-of-the-art results relative to strong baselines in their respective fields (TTS, motion generation).
6. Extensions, Future Directions, and Broader Impact
The ES-MoE framework provides a transferable architecture for emotion transfer and scenario adaptation in lifelong, high-dimensional sequence generation. Potential future directions include:
- Generalization to Other Embodied Tasks: Extending ES-MoE to human-object interactions, 4D character animation, or multi-agent coordination.
- Closed-Loop Real-Time Feedback: Integrating haptic or emotion sensors for robotic agents operating in continually shifting environments.
- Multimodal Lifelong Learning: Adding modalities such as audio-driven motion, speech-to-emotion alignment, and cross-modal style transfer.
- Advanced Causal Modeling: Incorporating more elaborate causal graphs accounting for latent intent and environment-driven confounders.
The ES-MoE methodology demonstrates that causal disentanglement and modular parameterization together enable both robust emotion transfer and durable, scenario-dependent skill acquisition in AI-driven generative models (Wang et al., 22 Dec 2025, Jawaid et al., 2024).