Multimodal Aspect-Based Sentiment Analysis
- MABSA is defined as extracting aspect terms and assigning sentiment polarity by jointly processing text-image pairs using advanced cross-modal techniques.
- It employs innovative fusion strategies, including context-aware attention and prompt-controlled extraction, to mitigate noisy social media inputs.
- Recent advancements integrate syntactic cues and LLM-generated rationales to enhance explainability and performance on benchmarks like Twitter-15 and Twitter-17.
Multimodal Aspect-Based Sentiment Analysis (MABSA) extends classic aspect-based sentiment analysis by jointly leveraging both text and visual modalities, typically text–image pairs from social media, to extract aspect terms and determine their associated sentiment polarity. This approach has gained substantial traction in recent years due to the increasing volume and complexity of multimodal user-generated content, requiring advanced techniques for fine-grained opinion mining. The capability to integrate complementary cues from text and images, while overcoming challenges posed by noisy or misaligned modalities, has led to a series of innovations in MABSA architectures, denoising strategies, context modeling, explainability mechanisms, and evaluation frameworks.
1. Problem Definition and Subtask Decomposition
Multimodal Aspect-Based Sentiment Analysis (MABSA) is formally defined as the task of extracting all aspect terms from a multimodal input and classifying each aspect with a corresponding sentiment (Rafiuddin et al., 17 Jul 2025, Song, 2024, Peng et al., 2023, Ling et al., 2022). This is operationalized via three canonical subtasks:
- Multimodal Aspect Term Extraction (MATE): Sequence labeling over the text to identify aspect boundaries, often via BIO tagging.
- Multimodal Aspect-Oriented Sentiment Classification (MASC): Sentiment polarity prediction for given aspects using both text and image features.
- Joint Multimodal Aspect-Sentiment Analysis (JMASA): Simultaneous extraction and classification of aspect–sentiment pairs in an end-to-end pipeline.
Dataset benchmarks such as Twitter-15 and Twitter-17 are standard, with annotated tweet–image pairs and multiple aspects per instance (Rafiuddin et al., 17 Jul 2025, Zhao et al., 2023, Yang et al., 2024, Song et al., 2024, Doan et al., 2024). Evaluation metrics include Precision, Recall, and F1 for pair extraction, and Accuracy or Macro-F1 for polarity classification.
2. Multimodal Representation and Fusion Mechanisms
Early work on MABSA relied on separate pre-trained text and image encoders (e.g., BERT + ResNet or Faster R-CNN), merged by co-attention or concatenation (Ling et al., 2022). However, recent models exploit sophisticated cross-modal fusion:
- Context-aware and adaptive attention (AdaptiSent): Computes per-token linguistic and visual relevance scores, combines them via a learnable bias , and injects as an adaptive bias into the scaled-dot product attention (Rafiuddin et al., 17 Jul 2025). Dynamic modality weighting coefficients enable selective fusion based on task relevance.
- Prompt-controlled fusion (DQPSA): Uses the prompt as a dual query to extract prompt-aware visual and textual features. Cross-attention over “description” and “prompt” tokens enables aspect-specific selection of image areas (Peng et al., 2023). Energy-based Pairwise Expert (EPE) jointly models span boundaries instead of independent start/end tagging.
- Vision-language pretraining (VLP-MABSA): Unified encoder–decoder architecture, pre-trained jointly on masked language modeling (MLM), masked region modeling (MRM), textual/visual aspect–opinion generation, and multimodal sentiment prediction (Ling et al., 2022).
- Pipeline architectures (PTA): Decomposes MABSA into sequential extraction and classification stages, using aspect predictions to guide visual attention and translation-based alignment (TBA) for semantic consistency between modalities (Song et al., 2024).
The effectiveness of aspect-aware fusion is further corroborated by models that employ selective attention, graph-based reasoning, and adaptive masking to suppress noise and enhance cross-modal alignments (Zhou et al., 2023, Zhao et al., 2023, Lawan et al., 2024).
3. Denoising Strategies for Noisy or Irrelevant Images
Social media images are frequently misaligned or irrelevant to the accompanying text, introducing substantial noise into multimodal fusion. Several frameworks have addressed this by unbiased denoising methods:
- Multi-grained Curriculum Learning (M2DF): Computes sentence-level and aspect-level noise metrics via CLIP similarity and Mask-RCNN object alignment, and dynamically schedules training examples based on their quantified noise levels, avoiding hard thresholds that risk discarding informative samples (Zhao et al., 2023). Dynamic scheduler switches between coarse/fine curricula.
- Hybrid Curriculum Denoising and Aspect-Enhance Denoising (DualDe): Filters training samples based on composite static and model-dependent difficulty scores, learning first from highly similar text–image pairs. Aspect-guided attention suppresses noise within retained samples by leveraging candidate aspects and affective priors (Doan et al., 2024).
- Data-uncertainty-aware weighting (UA-MABSA): Explicitly estimates image quality (brightness, resolution, text crowding) and cross-modal relevance to adjust the sample’s contribution to the loss during training (Yang et al., 2024). This mechanism prioritizes high-quality, highly-aligned samples and down-weights noisy ones.
Ablation studies in these works demonstrate significant performance drops when denoising modules or aspect-guided attention are removed, confirming their central role in robust MABSA execution.
4. Syntactic and Contextual Modeling
Recent advances have recognized the criticality of exploiting linguistic structure (dependency parsing, POS tags, NER) to refine sentiment cues and aspect boundaries:
- Dependency-structure augmented scoping (DASCO): Dual GNN branches over semantic and syntactic graphs, extracting aspect-specific scopes from dependency trees and applying adaptive contrastive losses to align cross-graph representations and filter semantic noise (Liu et al., 15 Apr 2025).
- Gated mLSTM architectures (GateMABSA): Sequentially integrates multimodal fusion, syntactic gating via dependency parses, and semantic similarity/positional gating to precisely aggregate sentiment context for each aspect (Lawan et al., 29 Sep 2025).
- Dependency-guided explanation generation (Explainable MABSA): Pruning and textualizing the aspect-centered dependency tree, then passing as prompts to MLLMs to generate aspect-grounded rationales jointly with sentiment prediction (Wang et al., 11 Jan 2026).
Direct evaluation confirms improved aspect–opinion alignment, higher aspect extraction recall, and enhanced explainability when dependency information is incorporated.
5. LLMs and Rationales Integration
LLMs, while powerful in general multimodal QA or conversation, exhibit significant limitations in fine-grained MABSA tasks (Song, 2024). Out-of-the-box performance from Llama2, LLaVA, or ChatGPT is typically 10–15 F1 points below state-of-the-art supervised models, and suffers from high latency and suboptimal aspect coverage. However, new research directions attempt to fuse LLM reasoning capacity with SLM efficiency:
- LLM-generated rationales as soft prompts (LRSA): LLMs (e.g., Gemini Pro) generate textual explanations of image/text cues, injected into SLM backbones and fused by dual cross-attention, yielding consistent improvements in extraction and classification (Cao et al., 20 May 2025).
- Conversational MABSA via structured prompting and LLM ensembling: Hierarchical prompting pipelines and fallback ensemble systems combine outputs from fine-tuned Qwen3-8B, Gemini-2.5-pro, and GPT-4.1-mini for multi-turn aspect–opinion–sentiment–rationale extraction and sentiment flip detection (Gao et al., 27 Dec 2025).
- Event deconstruction and RL: LLMs decomposing complex text into sub-events, followed by RL agent optimization for sequential aspect–sentiment prediction, simplifying scenarios with multiple entities and dynamic polarities (Huang et al., 2024).
These approaches indicate that LLM-guided rationales can augment structured SLM models in MABSA, but direct application of LLMs remains suboptimal for extraction or real-time deployment.
6. Cognitive, Aesthetic, and Explainability Extensions
Current frontiers in MABSA push beyond pure extraction/classification toward holistic understanding, causal reasoning, and explainability:
- Cognitive and aesthetic causality (Chimera): Incorporates fine-grained patch–word alignment, translation of visual features to textual descriptions, and multi-task rationale generation (semantic causes and impressions) using LLM supervision, allowing the model to infer both semantic and affective drivers of sentiment (Xiao et al., 22 Apr 2025).
- Explainable generative MABSA: MLLMs produce aspect-level free-text explanations using carefully constructed prompts—dependency textualizations—improving both accuracy and human interpretability (Wang et al., 11 Jan 2026).
Performance comparisons show such models consistently outperform LLM zero-shot baselines and prior SOTA sentiment classifiers, especially in explanation generation metrics (BLEU, ROUGE, BERTScore).
7. Summary Table: Performance Highlights Across SOTA MABSA Models
| Model | TW15 F1 | TW17 F1 | Key Mechanism | Reference |
|---|---|---|---|---|
| AdaptiSent | 71.9 | 71.6 | Context-adaptive cross-modal attention | (Rafiuddin et al., 17 Jul 2025) |
| PTA | 74.1 | 73.7 | Pipeline w/ translation-based alignment | (Song et al., 2024) |
| DQPSA | 71.9 | 70.6 | Prompt-dual query & energy-based spans | (Peng et al., 2023) |
| DASCO | 75.0 | - | Syntactic/semantic scoping via graphs | (Liu et al., 15 Apr 2025) |
| DualKanbaFormer | 75.5 | 71.5 | KANs + SSMs + gated sparse attention | (Lawan et al., 2024) |
| Chimera | 77.9 | 74.6 | Cognitive/aesthetic rationale learning | (Xiao et al., 22 Apr 2025) |
| LRSA(VLP) | 68.2 | 69.2 | LLM rationale injection | (Cao et al., 20 May 2025) |
| UA-MABSA | 74.5 | 70.2 | Data uncertainty-aware weighting | (Yang et al., 2024) |
| GateMABSA | 75.7 | 71.5 | Gated syntactic/semantic fusion | (Lawan et al., 29 Sep 2025) |
8. Open Problems and Research Directions
Open challenges remain in robust multimodal alignment (especially in the presence of noisy or missing modalities), scalable and efficient inference, generalization to conversations and event-driven scenarios, and transparent explainability. Promising research trajectories include:
- Lightweight attention architectures (Linformer, Performer) for computational efficiency (Rafiuddin et al., 17 Jul 2025).
- Advanced denoising (curriculum, event deconstruction, RL) and aspect/image relevance modeling (Doan et al., 2024, Huang et al., 2024).
- Integration of external commonsense/affect knowledge bases (SenticNet, ConceptNet) and causal-graph structures for deeper reasoning (Rafiuddin et al., 17 Jul 2025, Xiao et al., 22 Apr 2025).
- Augmented prompting and instruction tuning for LLMs, rationales as auxiliary supervision, and end-to-end explainable generative paradigms (Wang et al., 11 Jan 2026, Gao et al., 27 Dec 2025, Cao et al., 20 May 2025).
In summary, Multimodal Aspect-Based Sentiment Analysis represents a technically vibrant research field driven by the combination of advanced contextual fusion mechanisms, robust denoising strategies, syntactic/semantic reasoning, and LLM-guided explainability. State-of-the-art models achieve strong gains on standard social-media benchmarks, and ongoing innovations continue to expand the depth and transparency of multimodal sentiment understanding.