Emotion-LLaMAv2: Multimodal Emotion Recognition

Updated 30 January 2026

The paper introduces an end-to-end multimodal architecture that fuses raw audio and video signals with conv-attention pre-fusion for precise emotion recognition.
It employs a two-stage curriculum instruction tuning strategy, transitioning from categorical perception to chain-of-thought reasoning to enhance performance.
The framework achieves state-of-the-art results on the MMEVerse benchmark by integrating advanced multiview encoding and diverse multimodal datasets.

Emotion-LLaMAv2 is a multimodal LLM architecture and framework designed for robust, instruction-tuned emotion recognition and reasoning from complex multimodal signals. Developed as a successor to the initial Emotion-LLaMA, it integrates advanced end-to-end multiview encoding, sophisticated cross-modal fusion, and curriculum-based instruction tuning to address key limitations in emotional AI: lack of scalable high-quality data, suboptimal multimodal fusion, and limited reasoning ability. Together with the MMEVerse benchmark—a consolidated, re-annotated corpus of twelve major emotion datasets—Emotion-LLaMAv2 establishes a state-of-the-art evaluation and modeling pipeline for affective computing, human-computer interaction, and multimodal emotion research (Peng et al., 23 Jan 2026).

1. End-to-End Multiview Encoder and Multimodal Tokenization

Emotion-LLaMAv2 begins with an end-to-end, feedforward encoder that processes raw audio and video streams without external face detectors or late feature extraction. The architecture utilizes:

Audio Encoder: Processes raw waveform $A \in \mathbb{R}^{T \times 1}$ $A \in R^{T \times 1}$ (typically 16 kHz) with HuBERT or Whisper, producing high-level time-embedded tokens.
- Temporal pooling is applied ( $T^a = \text{Pool1D}(u^a) \in \mathbb{R}^{N_a \times d}$ ), yielding audio tokens.
Global Visual Encoder: Extracts high-level spatial features from a “middle” frame ( $f_{\text{mid}}$ ) of the video using EVA-ViT, yielding $N_{\text{glo}}$ global patch tokens.
Temporal Visual Encoder: Consumes $F'$ uniformly sampled frames via VideoMAE or frame-wise EVA, followed by spatial pooling to produce $N_{\text{temp}}$ temporal tokens.

All stream outputs are mapped to a unified latent space ( $d$ -dimensional) via small MLP projections ( $\phi$ ), constructing two intermediate tensors: blended in channel/depth (Concat $\mathbf{F}_d$ ) and as a length-3 token stack ( $\mathbf{F}_s$ ).

This multiview encoder captures both nuanced frame-level facial cues and audio dynamics, substantially extending the representational expressivity and directness over previous approaches requiring explicit face detection or single-modality context (Peng et al., 23 Jan 2026).

2. Conv-Attention Pre-Fusion Module

A dedicated pre-fusion module enables rich interaction between local and global multimodal features prior to entering the LLM backbone:

Attention Branch: Projects the mixed-token tensors into query, key, value representations ( $T^a = \text{Pool1D}(u^a) \in \mathbb{R}^{N_a \times d}$ 0), applies scaled dot-product attention, and produces an attended feature matrix ( $T^a = \text{Pool1D}(u^a) \in \mathbb{R}^{N_a \times d}$ 1).
Convolutional Branch: Iteratively applies 1D convolution with a gated “Switch” activation over the depth-concatenated features, accumulating local temporal and spatial patterns ( $T^a = \text{Pool1D}(u^a) \in \mathbb{R}^{N_a \times d}$ 2).

The outputs of both branches are fused as $T^a = \text{Pool1D}(u^a) \in \mathbb{R}^{N_a \times d}$ 3, providing a potent, joint multimodal embedding. The design removes explicit positional encoding, relying on token order and convolutional receptive field to encode local temporal structure (Peng et al., 23 Jan 2026).

3. Perception-to-Cognition Curriculum Instruction Tuning

The training protocol is organized as a two-stage, curriculum-tuned instruction paradigm, supported by Low-Rank Adaptation (LoRA) for efficient backbone updates:

Stage 1 (Perception / Recognition): Prompts focus only on categorical emotion recognition (e.g., "Identify the emotion from {anger, fear, ...}.") using standard cross-entropy loss

$T^a = \text{Pool1D}(u^a) \in \mathbb{R}^{N_a \times d}$ 4

for $T^a = \text{Pool1D}(u^a) \in \mathbb{R}^{N_a \times d}$ 5 native classes. This stage lasts for the initial 50K training steps.

Stage 2 (Cognition / Reasoning): Prompts require chain-of-thought reasoning in a structured "> ...<answer>" format. A language modeling objective is introduced: > > $T^a = \text{Pool1D}(u^a) \in \mathbb{R}^{N_a \times d}$ 6 > > The curriculum scheduling switches from pure recognition ( $T^a = \text{Pool1D}(u^a) \in \mathbb{R}^{N_a \times d}$ 7) to pure reasoning ( $T^a = \text{Pool1D}(u^a) \in \mathbb{R}^{N_a \times d}$ 8) after 50K steps. LoRA is applied to the LLaMA2-chat-7B backbone (rank 64, $T^a = \text{Pool1D}(u^a) \in \mathbb{R}^{N_a \times d}$ 9, peak LR $f_{\text{mid}}$ 0), training with mixed-precision on large-scale audio-visual-textual input without modifying the frozen base weights outside the adapters (Peng et al., 23 Jan 2026). > > ## 4. MMEVerse Benchmark: Large-Scale, Unified Multimodal Dataset > > Emotion-LLaMAv2 is trained and evaluated on the MMEVerse benchmark, a unified, large-scale corpus aggregating twelve publicly available tri-modal datasets forming 130K training and 36K testing clips. Key aspects: > > - Datasets: Includes MER2023, MELD, IEMOCAP, CAER, E³, DFEW, MAFW, MC-EIU, CMU-MOSI, CMU-MOSEI, CH-SIMS v2, and BOLD, covering basic emotion, sentiment, intent, and VAD categories across diverse domains and languages. > > - Annotation Pipeline: A multi-agent protocol combines OpenFace AU peak-frame detection, Qwen2.5-VL for visual context, Qwen2-Audio for audio tone, integration with lexical subtitles, followed by GPT-4o consolidation and selective human verification (600-sample subset, $f_{\text{mid}}$ 1). Outputs are re-formatted as concise, instruction-ready packets for cross-dataset uniformity (Peng et al., 23 Jan 2026). > > - No Unified Label Space: Each dataset retains its native class ontology, with evaluation performed on native splits—there is no forced mapping to a global taxonomy. > > - Quality Control: Prompts are designed for precision; model-based and human validations ensure multimodal consistency (modal agreement $f_{\text{mid}}$ 2). > > ## 5. Experimental Results and Comparative Evaluation > > Emotion-LLaMAv2 demonstrates state-of-the-art performance across diverse emotion tasks assessed on the MMEVerse-Bench and EMER Reasoning benchmarks. Core findings: > > | Benchmark / Metric | AffectGPT | Qwen2.5-7B | Emotion-LLaMAv2 | > |---------------------------|-----------|-------------|-----------------| > | MER-UniBench (9 sets) | 74.77% | 67.16% | 78.52% | > | MMEVerse-Bench (18 sets) | 54.11% | 52.73% | 66.63% | > | EMER Reasoning (Clue/Label)| 5.87/5.79 | — | 7.30/7.14 | > > - Ablation Studies: Removing Conv-Attention pre-fusion reduces test accuracy by approximately 0.9 percentage points; single-stage joint training, as opposed to two-stage curriculum, results in a ~2.4 point drop. Whisper-based audio encoding outperforms HuBERT (66.05% vs. 61.44%). Optimal performance is achieved with 16-frame video and 64 audio tokens in each sample. > > - Evaluation Metrics: Use native metrics—hit rate/accuracy (emotion, intent), weighted F1 (sentiment), mean Average Precision (multi-label), and Clue/Label Overlap on reasoning—mirroring the diverse dataset label spaces and task ontologies (Peng et al., 23 Jan 2026). > > ## 6. Architectural Advances and Theoretical Implications > > Emotion-LLaMAv2 departs from prior MLLMs in three dimensions: > > - Direct End-to-End Encoding: Reduces error propagation and information loss intrinsic to external face detectors and late fusion, allowing facial micro-expression, prosody, and complex temporal cues to be captured holistically. > > - Conv-Attention Pre-fusion: Enables simultaneous modeling of local (short-term) and global (long-range) interactions between modalities, as opposed to implicit or late multimodal attention strategies. > > - Curriculum Instruction Tuning: The perception-to-cognition curriculum achieves stepwise enhancement, first training core recognition, then free-form causal reasoning, evidenced by improved Clue/Label Overlap and cross-task generalization. > > A plausible implication is that multimodal LLMs trained under this pipeline can serve as foundational models for high-fidelity human emotional understanding across human-robot interaction, affective feedback devices, and clinical applications. The modular pre-fusion stage and instruction-based multi-task curriculum facilitate extensibility to future advances in audio-visual modeling and instruction generation. > > ## 7. Future Directions and Limitations > > Emotion-LLaMAv2 provides an extensible research and benchmark platform, but the current architecture and dataset composition present areas for further development: > > - Scalability: Expansion to even larger, richly annotated, cross-lingual emotion datasets. > > - Autonomous Modal Reasoning: Deeper integration of self-supervised and causal reasoning across longer timescales and richer sensory streams. > > - Label Generalization: MMEVerse’s native-label approach reflects true deployment heterogeneity; future work might explore flexible transfer mapping and zero-shot adaptation. > > - Quality Control: While multi-agent annotation provides high agreement, further automation, granularity, and interpretability in annotation and model output remain active research goals. > > Through these advances, Emotion-LLaMAv2 and MMEVerse set a comprehensive standard for the next generation of emotion-aware, instruction-tuned multimodal LLMs, capable of consistent recognition and reasoning in unconstrained human environments (Peng et al., 23 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Emotion-LLaMAv2 and MMEVerse: A New Framework and Benchmark for Multimodal Emotion Understanding (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Emotion-LLaMAv2.