Papers
Topics
Authors
Recent
Search
2000 character limit reached

Emotion-LLaMAv2: Multimodal Emotion Recognition

Updated 30 January 2026
  • The paper introduces an end-to-end multimodal architecture that fuses raw audio and video signals with conv-attention pre-fusion for precise emotion recognition.
  • It employs a two-stage curriculum instruction tuning strategy, transitioning from categorical perception to chain-of-thought reasoning to enhance performance.
  • The framework achieves state-of-the-art results on the MMEVerse benchmark by integrating advanced multiview encoding and diverse multimodal datasets.

Emotion-LLaMAv2 is a multimodal LLM architecture and framework designed for robust, instruction-tuned emotion recognition and reasoning from complex multimodal signals. Developed as a successor to the initial Emotion-LLaMA, it integrates advanced end-to-end multiview encoding, sophisticated cross-modal fusion, and curriculum-based instruction tuning to address key limitations in emotional AI: lack of scalable high-quality data, suboptimal multimodal fusion, and limited reasoning ability. Together with the MMEVerse benchmark—a consolidated, re-annotated corpus of twelve major emotion datasets—Emotion-LLaMAv2 establishes a state-of-the-art evaluation and modeling pipeline for affective computing, human-computer interaction, and multimodal emotion research (Peng et al., 23 Jan 2026).

1. End-to-End Multiview Encoder and Multimodal Tokenization

Emotion-LLaMAv2 begins with an end-to-end, feedforward encoder that processes raw audio and video streams without external face detectors or late feature extraction. The architecture utilizes:

  • Audio Encoder: Processes raw waveform ART×1A \in \mathbb{R}^{T \times 1} (typically 16 kHz) with HuBERT or Whisper, producing high-level time-embedded tokens.
    • Temporal pooling is applied (Ta=Pool1D(ua)RNa×dT^a = \text{Pool1D}(u^a) \in \mathbb{R}^{N_a \times d}), yielding audio tokens.
  • Global Visual Encoder: Extracts high-level spatial features from a “middle” frame (fmidf_{\text{mid}}) of the video using EVA-ViT, yielding NgloN_{\text{glo}} global patch tokens.
  • Temporal Visual Encoder: Consumes FF' uniformly sampled frames via VideoMAE or frame-wise EVA, followed by spatial pooling to produce NtempN_{\text{temp}} temporal tokens.

All stream outputs are mapped to a unified latent space (dd-dimensional) via small MLP projections (ϕ\phi), constructing two intermediate tensors: blended in channel/depth (Concat Fd\mathbf{F}_d) and as a length-3 token stack (Fs\mathbf{F}_s).

This multiview encoder captures both nuanced frame-level facial cues and audio dynamics, substantially extending the representational expressivity and directness over previous approaches requiring explicit face detection or single-modality context (Peng et al., 23 Jan 2026).

2. Conv-Attention Pre-Fusion Module

A dedicated pre-fusion module enables rich interaction between local and global multimodal features prior to entering the LLM backbone:

  • Attention Branch: Projects the mixed-token tensors into query, key, value representations (Q,K,VRd×dQ, K, V \in \mathbb{R}^{d \times d}), applies scaled dot-product attention, and produces an attended feature matrix (Fattn\mathbf{F}_{\text{attn}}).
  • Convolutional Branch: Iteratively applies 1D convolution with a gated “Switch” activation over the depth-concatenated features, accumulating local temporal and spatial patterns (FconvN\mathbf{F}_{\text{conv}}^N).

The outputs of both branches are fused as uf=Fattn+FconvNu_{f} = \mathbf{F}_{\text{attn}} + \mathbf{F}_{\text{conv}}^{N}, providing a potent, joint multimodal embedding. The design removes explicit positional encoding, relying on token order and convolutional receptive field to encode local temporal structure (Peng et al., 23 Jan 2026).

3. Perception-to-Cognition Curriculum Instruction Tuning

The training protocol is organized as a two-stage, curriculum-tuned instruction paradigm, supported by Low-Rank Adaptation (LoRA) for efficient backbone updates:

  • Stage 1 (Perception / Recognition): Prompts focus only on categorical emotion recognition (e.g., "Identify the emotion from {anger, fear, ...}.") using standard cross-entropy loss

Lrecog=c=1Cyclogp(cfeatures)\mathcal{L}_{\text{recog}} = -\sum_{c=1}^{C} y_c \log p(c \mid \text{features})

for CC native classes. This stage lasts for the initial 50K training steps.

  • Stage 2 (Cognition / Reasoning): Prompts require chain-of-thought reasoning in a structured "> ...<answer>" format. A language modeling objective is introduced: > > Lreason=t=1Tlogp(xtX,x<t)\mathcal{L}_{\text{reason}} = -\sum_{t=1}^T \log p(x_t^* \mid X, x_{<t}^*) > > The curriculum scheduling switches from pure recognition (α(t)=1\alpha(t)=1) to pure reasoning (α(t)=0\alpha(t)=0) after 50K steps. LoRA is applied to the LLaMA2-chat-7B backbone (rank 64, α=16\alpha=16, peak LR 1×1041\times 10^{-4}), training with mixed-precision on large-scale audio-visual-textual input without modifying the frozen base weights outside the adapters (Peng et al., 23 Jan 2026). > > ## 4. MMEVerse Benchmark: Large-Scale, Unified Multimodal Dataset > > Emotion-LLaMAv2 is trained and evaluated on the MMEVerse benchmark, a unified, large-scale corpus aggregating twelve publicly available tri-modal datasets forming 130K training and 36K testing clips. Key aspects: > > - Datasets: Includes MER2023, MELD, IEMOCAP, CAER, E³, DFEW, MAFW, MC-EIU, CMU-MOSI, CMU-MOSEI, CH-SIMS v2, and BOLD, covering basic emotion, sentiment, intent, and VAD categories across diverse domains and languages. > > - Annotation Pipeline: A multi-agent protocol combines OpenFace AU peak-frame detection, Qwen2.5-VL for visual context, Qwen2-Audio for audio tone, integration with lexical subtitles, followed by GPT-4o consolidation and selective human verification (600-sample subset, κ0.65\kappa \approx 0.65). Outputs are re-formatted as concise, instruction-ready packets for cross-dataset uniformity (Peng et al., 23 Jan 2026). > > - No Unified Label Space: Each dataset retains its native class ontology, with evaluation performed on native splits—there is no forced mapping to a global taxonomy. > > - Quality Control: Prompts are designed for precision; model-based and human validations ensure multimodal consistency (modal agreement >0.90>0.90). > > ## 5. Experimental Results and Comparative Evaluation > > Emotion-LLaMAv2 demonstrates state-of-the-art performance across diverse emotion tasks assessed on the MMEVerse-Bench and EMER Reasoning benchmarks. Core findings: > > | Benchmark / Metric | AffectGPT | Qwen2.5-7B | Emotion-LLaMAv2 | > |---------------------------|-----------|-------------|-----------------| > | MER-UniBench (9 sets) | 74.77% | 67.16% | 78.52% | > | MMEVerse-Bench (18 sets) | 54.11% | 52.73% | 66.63% | > | EMER Reasoning (Clue/Label)| 5.87/5.79 | — | 7.30/7.14 | > > - Ablation Studies: Removing Conv-Attention pre-fusion reduces test accuracy by approximately 0.9 percentage points; single-stage joint training, as opposed to two-stage curriculum, results in a ~2.4 point drop. Whisper-based audio encoding outperforms HuBERT (66.05% vs. 61.44%). Optimal performance is achieved with 16-frame video and 64 audio tokens in each sample. > > - Evaluation Metrics: Use native metrics—hit rate/accuracy (emotion, intent), weighted F1 (sentiment), mean Average Precision (multi-label), and Clue/Label Overlap on reasoning—mirroring the diverse dataset label spaces and task ontologies (Peng et al., 23 Jan 2026). > > ## 6. Architectural Advances and Theoretical Implications > > Emotion-LLaMAv2 departs from prior MLLMs in three dimensions: > > - Direct End-to-End Encoding: Reduces error propagation and information loss intrinsic to external face detectors and late fusion, allowing facial micro-expression, prosody, and complex temporal cues to be captured holistically. > > - Conv-Attention Pre-fusion: Enables simultaneous modeling of local (short-term) and global (long-range) interactions between modalities, as opposed to implicit or late multimodal attention strategies. > > - Curriculum Instruction Tuning: The perception-to-cognition curriculum achieves stepwise enhancement, first training core recognition, then free-form causal reasoning, evidenced by improved Clue/Label Overlap and cross-task generalization. > > A plausible implication is that multimodal LLMs trained under this pipeline can serve as foundational models for high-fidelity human emotional understanding across human-robot interaction, affective feedback devices, and clinical applications. The modular pre-fusion stage and instruction-based multi-task curriculum facilitate extensibility to future advances in audio-visual modeling and instruction generation. > > ## 7. Future Directions and Limitations > > Emotion-LLaMAv2 provides an extensible research and benchmark platform, but the current architecture and dataset composition present areas for further development: > > - Scalability: Expansion to even larger, richly annotated, cross-lingual emotion datasets. > > - Autonomous Modal Reasoning: Deeper integration of self-supervised and causal reasoning across longer timescales and richer sensory streams. > > - Label Generalization: MMEVerse’s native-label approach reflects true deployment heterogeneity; future work might explore flexible transfer mapping and zero-shot adaptation. > > - Quality Control: While multi-agent annotation provides high agreement, further automation, granularity, and interpretability in annotation and model output remain active research goals. > > Through these advances, Emotion-LLaMAv2 and MMEVerse set a comprehensive standard for the next generation of emotion-aware, instruction-tuned multimodal LLMs, capable of consistent recognition and reasoning in unconstrained human environments (Peng et al., 23 Jan 2026).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Emotion-LLaMAv2.