Papers
Topics
Authors
Recent
Search
2000 character limit reached

ES-MoE for Lifelong Empathic Motion Generation

Updated 29 December 2025
  • The paper introduces ES-MoE, a novel framework integrating causal-guided emotion decoupling with scenario-adapted mixture-of-experts to achieve continual emotional motion generation.
  • It employs VQ-VAE tokenization, low-rank adaptation, and a dedicated gating network to efficiently learn and adapt to new emotional scenarios while minimizing forgetting.
  • Quantitative evaluations show improved metrics such as FID, R-Precision, and reduced forgetting compared to existing multi-task and sequential adaptation methods.

Emotion-Transferable and Scenario-Adapted Mixture of Experts (ES-MoE) is a methodological paradigm for lifelong empathic motion generation and expressive style transfer. ES-MoE is designed to enable artificial agents, especially LLMs, to robustly encode, transfer, and generate emotional expression across a continually expanding set of scenarios—while avoiding catastrophic forgetting and preserving the fidelity of both seen and novel emotional cues. The ES-MoE framework unifies causal-guided emotion decoupling with scenario-adapted, expert-augmented parameterization, leveraging both mixture-of-experts (MoE) architectures and recent advances in low-rank adaptation for continual lifelong learning (Wang et al., 22 Dec 2025). Closely related work in expressive text-to-speech demonstrates MoE-based style encoding for cross-domain emotional and prosodic generalization (Jawaid et al., 2024).

1. Core Challenges in Lifelong Empathic Generation

The central challenges in articulated, human-centric motion and expressive generation are (a) emotion decoupling and (b) scenario adaptation, which underpin the requirements of generalization and memory retention in lifelong learning settings:

  • Emotion Decoupling: Involves extracting and reusing shared affective signals (e.g., slumped posture for “sad”) that are invariant across scenario boundaries, rather than entangling these with scenario-specific movement idiosyncrasies. Failure to separate core emotional cues from contextual features compromises zero-shot transfer and leads to misinterpretation—e.g., reading emotionally neutral actions in novel domains as affective.
  • Scenario Adaptation: Requires acquiring new, scenario-specific movement schemas (e.g., sports, dance, acrobatics) while preserving prior scenario knowledge, thus preventing catastrophic forgetting. Embodied agents must maintain previous emotional mappings (how “happy” or “sad” manifest in “daily life” versus “shows”) even as their behavioral repertoire expands.

These challenges are formalized within the LLM-Centric Lifelong Empathic Motion Generation (L2-EMG) task, where models are continually trained on a sequence of motion datasets differing in scenario and emotion, demanding both efficient knowledge uptake and robust transfer (Wang et al., 22 Dec 2025).

2. ES-MoE Architecture: Tokenization, Causal Decoupling, and Expert Adaptation

2.1 Motion Tokenization via VQ-VAE

A vector quantized variational autoencoder (VQ-VAE) transforms raw 3D joint sequences momo into discrete “motion tokens” mtmt, providing a symbolic interface for downstream LLMs. The VQ-VAE comprises an encoder E\mathcal{E}, codebook C\mathcal{C}, quantizer Quan()\text{Quan}(\cdot), and decoder D\mathcal{D}, optimized by

Lvq=Lre+Lembed+LcommitL_{vq} = L_{re} + L_{embed} + L_{commit}

where LreL_{re} (reconstruction), LembedL_{embed} (embedding), and LcommitL_{commit} (codebook commitment) enforce quantized, discrete latent structure amenable to compositional reasoning.

2.2 Causal-Guided Emotion Decoupling Block

The causal-guided emotion decoupling block (CGED) operationalizes front-door adjustment [Pearl 2018] to separate causal emotion features (MM) from scenario-sensitive confounders (CC):

  • Causal Structure (Fig. 3c): XX (input features) → MM (decoupled features) → YY (emotion label); XCYX\,{\leftarrow}\,C\,{\rightarrow}\,Y.
  • Front-Door Adjustment:

P(Ydo(X=x))=mP(M=mX=x) xP(X=x)P(YX=x,M=m)(Eqn 1)P(Y\,|\,do(X{=}x)) = \sum_m P(M{=}m\,|\,X{=}x)\ \sum_{x'} P(X{=}x')\,P(Y\,|\,X{=}x',M{=}m) \quad \text{(Eqn 1)}

  • NWGM Approximation: Attention-based reweighting yields

P(Ydo(X=x))Softmax(ϕ(sx,sm))P(Y\,|\,do(X{=}x)) \approx \text{Softmax}(\phi(s_x, s_m))

with sx,sms_x, s_m computed from parametric projections and global cluster centroids (see Table 1 below). The emotion-constraint loss

Lemo=CrossEntropy(ye,y^e)L_{emo} = \text{CrossEntropy}(y_e, \hat{y}_e)

imposes supervision on decoupled emotion representations.

CGED Variable Description Mechanism
XX Input features (e.g. motion tokens) Projected to queries/keys
MM Decoupled emotion features Parametric/probabilistic mapping
CC Confounder variables (scenario/identity) Cluster centers (K-means)
YY Emotion label Cross-entropy supervised

2.3 Scenario-Adapted Expert Construction

For each scenario SiS_i, a LoRA-based expert is constructed with low-rank parameter updates Δθi=AiBi\Delta \theta_i = A_i B_i, aggregated into the forward model:

θ=θ0+j=1iWjΔθj(Eqn 3)\theta' = \theta_0 + \sum_{j=1}^i W_j \Delta \theta_j \quad \text{(Eqn 3)}

Here, θ0\theta_0 is the shared backbone (e.g., LLaMA2), and WjW_j the gating weight assigned to expert jj.

Gating Network

The gating network computes

Wi=Gate(h,Ki)=exp(q(h)Ki)j=1iexp(q(h)Kj)(Eqn 4)W_i = \text{Gate}(h, K_i) = \frac{\exp(q(h)\cdot K_i)}{\sum_{j=1}^i \exp(q(h)\cdot K_j)} \quad \text{(Eqn 4)}

with hh the emotion-highlighted input from CGED, KiK_i an orthogonally-initialized expert key, and q()q(\cdot) a down-up projection for key alignment. New scenario experts are added online; only the new {Ai,Bi,Ki}\{A_i, B_i, K_i\} are trainable on DiD_i while predecessors are frozen (optionally stochastically masked).

2.4 Optimization and Lifelong Learning Protocol

Overall training integrates (stage 1) VQ-VAE quantizer loss LvqL_{vq} and (stage 2) scenario-conditioned MoE loss

L=LLLM+λLemoL = L_{LLM} + \lambda L_{emo}

where LLLML_{LLM} is token/sequence modeling loss and LemoL_{emo} enforces emotion discriminability in decoupled representations. Lifelong learning proceeds by incrementally instantiating new experts and recomputing gated aggregations as new scenarios are observed.

3. Datasets and Continual Learning Splits

ES-MoE is benchmarked on composite and scenario-varied datasets designed for lifelong empathic motion generation (Wang et al., 22 Dec 2025):

  • Motion Tokenizer: Trained on the union of EmotionalT2M (with six basic emotions) and Motion-X, both containing text–motion pairs.
  • Continual Scenario Training Set: Eight scenarios—Daily Life, Sports, Dance, Shows, Game, Animation, Instrument Play, Acrobatics—totaling 19,916 samples, each with labeled emotions (Sad, Angry, Happy, Fear, etc.).
  • Training Regimes: Two splits:
    • Unseen: Sequential fine-tuning S1S2S8S_1 \to S_2 \to \ldots \to S_8 to stress continual adaptation and retention.
    • Mixed: Each subset DiD_i augments a primary scenario with a smoothed fraction of data from other scenarios.
Split Description Sequence
Unseen Sequential scenario expansion S1S2S8S_1 \to S_2 \to \ldots S_8
Mixed Scenario with fractions others Each DiD_i: primary + others

Train/validation/test ratio per subset is 0.80/0.05/0.15.

4. Evaluation Metrics, Comparative Results, and Ablations

Comprehensive evaluation covers diversity, emotion transfer, and forgetting, using the following quantitative metrics:

  • Average FID (AF): Lower is better. Measures distributional similarity to ground truth motions.
  • R-Precision (AR): Higher indicates better alignment between generated and reference sequences.
  • Diversity (AD) / Multimodality (AMM): Higher is desirable for diverse, multi-tone outputs.
  • Emotion-Weighted F1 (AWF): Captures accuracy conditioned on emotion.
  • Forgetting Rate (FR): Lower/negative is better–indicates maintenance of previous knowledge.

ES-MoE, using a LLaMA2 7B backbone and LoRA adaptation, surpasses multitask learning (MTL), sequential LoRA, lifelong fine-tuning (LwF-LoRA), EPI, O-LoRA, Prog-Prompt, and SAPT across most metrics (e.g., AF = 1.89 for ES-MoE vs. 2.12 for SAPT; FR = –1.03 vs. –0.54) (Wang et al., 22 Dec 2025).

Ablation studies indicate that removal of the CGED block or the emotion loss (LemoL_{emo}) results in significant drops in AWF and AR, while omitting scenario-based MoE (i.e., single LoRA expert) leads to markedly worse AF and FR—demonstrating necessity of both decoupling and expert modularity.

Qualitative motion visualizations and FID(i → j) heatmaps confirm that ES-MoE yields coordinated, emotion-aligned motions with nuanced, scenario-specific expressivity and minimal forgetting of prior domains.

5. Relation to Mixture of Experts in Expressive TTS

The ES-MoE approach builds conceptually on mixture-of-experts methods such as StyleMoE for expressive text-to-speech synthesis (Jawaid et al., 2024). In StyleMoE, style encoding is handled via MoE layers in which each expert models a portion of the style space (emotion, timbre, prosody), with a gating network performing sparse top-kk routing based on reference speech features. This design enables subspace specialization and improved zero-shot style transfer, even in unseen speech domains. Style embeddings, learned implicitly, govern emotional and prosodic dimensions without explicit emotion labels, paralleling ES-MoE’s emphasis on transferability and adaptation.

Both methods evaluate using a suite of intelligibility, quality, prosody, and style-fidelity metrics, and both demonstrate state-of-the-art results relative to strong baselines in their respective fields (TTS, motion generation).

6. Extensions, Future Directions, and Broader Impact

The ES-MoE framework provides a transferable architecture for emotion transfer and scenario adaptation in lifelong, high-dimensional sequence generation. Potential future directions include:

  • Generalization to Other Embodied Tasks: Extending ES-MoE to human-object interactions, 4D character animation, or multi-agent coordination.
  • Closed-Loop Real-Time Feedback: Integrating haptic or emotion sensors for robotic agents operating in continually shifting environments.
  • Multimodal Lifelong Learning: Adding modalities such as audio-driven motion, speech-to-emotion alignment, and cross-modal style transfer.
  • Advanced Causal Modeling: Incorporating more elaborate causal graphs accounting for latent intent and environment-driven confounders.

The ES-MoE methodology demonstrates that causal disentanglement and modular parameterization together enable both robust emotion transfer and durable, scenario-dependent skill acquisition in AI-driven generative models (Wang et al., 22 Dec 2025, Jawaid et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Emotion-Transferable and Scenario-Adapted Mixture of Experts (ES-MoE).