Emotion-Audio-Guided Spatial Attention
- Emotion-audio-guided spatial attention involves neural systems amplifying emotion-rich zones in audio for effective emotion recognition.
- MFHCA leverages log-Mel spectrograms and hierarchical patterns to boost accuracy in speech emotion recognition through advanced attention mechanisms.
- VAANet combines visual and auditory features using sequential attention layers, enhancing emotion recognition performance across diverse modalities.
Emotion-audio-guided spatial attention refers to neural frameworks that employ attention mechanisms—typically spatial, temporal, and hierarchical—to selectively emphasize emotion-salient regions within audio-derived or multimodal representations for emotion recognition tasks. The paradigm is exemplified by recent advances in speech emotion recognition and video emotion recognition, notably the MFHCA architecture (Jiao et al., 2024) for auditory modalities and VAANet (Zhao et al., 2020) for audiovisual contexts. These methods allocate computational focus to regions or time segments most predictive of emotional state, often leveraging spectrogram and learned feature representations.
1. Multi-Spatial Fusion (MF) Module (MFHCA)
MF operates exclusively on log-Mel spectrogram inputs , where channels span the Mel bands, denotes frequency bins, and covers time frames. The procedure begins with parallel convolutions emphasizing distinct axes:
- Temporal:
- Frequency:
The results are concatenated and processed through serial Global Receptive Field (GRF) blocks. Each GRF block produces output , combining a spatial attention branch and a large-context residual branch.
Spatial attention branch () employs “strip” pooling for context aggregation along axes: These pooled vectors are mixed, activated with Swish, and projected into axis-specific sigmoid-gated attention masks. The reweighting operation: focuses the network’s response on emotion-rich time-frequency zones.
Large-context residual branch () applies: capturing broad contextual dependencies through downsampling, convolution, and upsampling.
After GRF blocks and optional projection, the feature map is reshaped to for subsequent processing.
2. Hierarchical Cooperative Attention (HCA) Module (MFHCA)
The HCA module fuses spectrogram features with self-supervised Hubert representations and their BiLSTM-encoded variant . The fusion mechanism proceeds as follows:
- Compute cross-modal attention scores: $A = \mathrm{softmax}(f_{\rm spec} \, f'_{\rm hubert}^{T}) \in \mathbb{R}^{T \times L}$
- Use scores to attend to Hubert’s raw features:
- Fuse attended audio context into spectrogram features: No additional gating or residuals are added, with hierarchical fusion solely from attention over progressively contextualized audio embeddings.
3. Spatial and Channel-wise Attention in Multimodal Networks (VAANet)
VAANet implements sequential attention layers in both visual and audio streams. For video, spatial attention for each segment feature is computed:
After spatial pooling, channel-wise attention is computed similarly. Temporal attention aggregates segment-level features for both modalities:
Fusion concatenates attended visual and audio vectors before classification.
There is no direct audio-to-spatial fusion for visual attention; attention masks are derived from unimodal representations in the published VAANet instantiation.
4. Training Objectives and Loss Formulations
MFHCA utilizes standard cross-entropy over final predictions: There are no auxiliary or attention-specific supervisory losses.
VAANet introduces the polarity-consistent cross-entropy: $\mathcal{L}_{PCCE} = -\frac{1}{N}\sum_{i=1}^N\sum_{c=1}^C (1 + \lambda g(c, y_i)) \mathds{1}_{[c = y_i]}\,\log p_{i,c}$ where indicates class polarity opposition and is a penalty.
5. Empirical Performance of Emotion-Audio-Guided Spatial Attention
MFHCA achieves significant improvements over previous audio-only models on IEMOCAP (4-class), reporting weighted accuracy (WA) of (a increase) and unweighted accuracy (UA) of (up ). Ablation studies indicate:
| Configuration | WA | UA |
|---|---|---|
| Spec alone | 62.13 | 62.25 |
| Spec + MF | 73.72 | 74.53 |
| Spec + HCA | 73.19 | 73.72 |
| Full (MF + HCA) | 74.24 | 74.57 |
MF contributes approximately gain in WA over spec+Hubert, with HCA providing an additional improvement. t-SNE projections corroborate enhanced class separability with HCA.
In VAANet, experimental results on VideoEmotion-8 and Ekman-6 verify that combined spatial, channel, and temporal attention modules in visual and audio streams yield improved accuracy over previous pipelines, with explicit polarity penalties guiding more discriminative attention allocation (Zhao et al., 2020).
6. Architectural and Training Hyperparameters
MFHCA’s MF module processes log-Mel spectrograms with parallel convolutions and stacked GRF blocks, while HCA utilizes both spectrogram and Hubert features and a three-layer classifier. VAANet’s backbone comprises 3D ResNet-101 (visual, pre-trained on Kinetics) and 2D ResNet-18 (audio, pre-trained on ImageNet), with temporal segmentation (), frames per segment, and input resolution . Batch size is $32$, trained for $150$ epochs using Adam optimizer ( learning rate), and visual stream data augmentation includes random crop and horizontal flip. Attention modules are implemented via learned linear layers or convolutions.
7. Significance and Implications
Emotion-audio-guided spatial attention frameworks, as instantiated in MFHCA and VAANet, enable models to selectively amplify segments and regions most relevant to emotion inference, yielding quantifiable improvements in accuracy. MFHCA demonstrates that multi-spatial fusion via axis-aware masks, combined with hierarchical audio feature integration, is effective for speech emotion recognition. VAANet extends these principles to the visual domain through modular attention schemes. A plausible implication is that further development of cross-modal attention and spatial-temporal alignment strategies will continue to enhance emotion recognition systems, particularly in complex multimodal environments (Jiao et al., 2024, Zhao et al., 2020).