Multimodal Sarcasm Explanation (MuSE)

Updated 4 February 2026

Multimodal Sarcasm Explanation (MuSE) is a conditional generation task that produces natural language rationales for sarcastic content using cross-modal cues.
It employs advanced fusion techniques including attention, graph-based semantic reasoning, and target-aware token integration to capture sarcasm nuances.
Applications span improved sarcasm detection and emotion recognition, though challenges remain in target identification, data scarcity, and dynamic knowledge integration.

Multimodal Sarcasm Explanation (MuSE) is a conditional generation task that seeks to produce natural language explanations for the implicit irony or ridicule present in multimodal posts or dialogues—typically those containing both textual and visual components, and, in dialogue settings, possibly also audio. Unlike traditional sarcasm detection, which yields a binary label, MuSE constructs a rationale that explicitly identifies and interprets modality-specific cues and their incongruity. The field has evolved from initial attention-based encoder-decoder models and handcrafted object-label augmentations to graph-based semantic reasoning, explicit sentiment modeling, probabilistic causal chains, and (recently) target-guided and instruction-augmented LLM pipelines. This article surveys the formal definition, leading architectures, datasets, evaluation paradigms, key findings, and open research avenues for MuSE.

1. Task Definition and Problem Formulation

The MuSE task is formally defined as follows: Given a sarcastic input sample—usually an image $V$ and an associated caption $C$ (social media), or a sequence of utterances with possible visual/audio alignment (dialogue)—the system must generate a free-form natural language explanation $E$ that explicates the ironic intent with respect to the multimodal context (Goel et al., 11 Feb 2025, Desai et al., 2021, Kumar et al., 2022, Kumar et al., 2022).

A canonical formulation is: $E^* = \arg\max_E \; p_\theta(E \mid V, C, (T)),$ where $T$ denotes the explicit target of sarcasm when available (entity, event, or concept), and $p_\theta$ parametrizes the model. In dialogue tasks, $V$ and $C$ are replaced by sequences; Sarcasm Explanation in Dialogue (SED) approaches extend MuSE to code-mixed, multi-party interaction with possibilities for audio and video input (Kumar et al., 2022, Kumar et al., 2022, Ouyang et al., 2024, Guo et al., 28 Jan 2026).

MuSE is evaluated both as a pure generation task (natural-language explanation) and often as a downstream enhancement to sarcasm detection, humor identification, or emotion recognition classifiers (Kumar et al., 2022).

2. Datasets and Annotation Paradigms

The progress of MuSE research is closely tied to two main datasets: MORE and WITS, each built for different modalities and domains.

MORE (Multimodal Sarcasm Explanation Dataset): 3,510 English sarcastic image-caption pairs, each annotated with a free-form rationale focusing on cross-modal incongruity (Desai et al., 2021). MORE+ augments these with explicit human-annotated sarcasm targets (Goel et al., 11 Feb 2025).
- Splits: 2,983 train, 175 val, 352 test.
- Average caption 19.7 tokens, explanation 15.4, target 4.2.
WITS (Why Is This Sarcastic?): 2,240 code-mixed (Hindi-English) multi-party dialogues from television scripts, annotated with sarcasm triggers, targets, and explanations (Kumar et al., 2022, Kumar et al., 2022, Ouyang et al., 2024, Guo et al., 28 Jan 2026).

Additional benchmarks for VLM-powered methods include MuSE (Desai et al.), MMSD2.0, and SarcNet, varying in balance, explanation coverage, and language (Basnet et al., 13 Oct 2025).

3. Core Model Architectures and Key Innovations

The following table summarizes major architectural milestones in MuSE research:

Model	Fusion Approach	External Knowledge	Target Awareness	Evaluation Domain
ExMore (Desai et al., 2021)	Cross-modal transformer	–	–	IMAGE+TEXT (MORE)
TEAM (Jing et al., 2023)	Multi-source semantic graph	ConceptNet	–	IMAGE+TEXT (MORE)
TURBO (Goel et al., 11 Feb 2025)	Target-augmented shared fusion	ConceptNet	Explicit	IMAGE+TEXT (MORE+)
MOSES (Kumar et al., 2022)	Spotlight-aware modal fusion	–	Indirect	DIALOGUE (WITS)
EDGE (Ouyang et al., 2024)	Sentiment-enhanced graph	SenticNet	Indirect	DIALOGUE (WITS)
MuVaC (Guo et al., 28 Jan 2026)	Causal ATF + variational chain	–	Indirect	DIALOGUE (WITS)
VLM+MuSE (Wang et al., 5 Aug 2025)	Prompt-based pipeline	ConceptNet	–	IMAGE+TEXT (MORE)

ExMore

Employs a cross-modal Transformer encoder that aligns textual and visual features (VGG-19+BART), optimized with negative log-likelihood for explanation generation (Desai et al., 2021).

TEAM

Introduces multi-source semantic graphs comprising caption tokens, object-level metadata (from Faster R-CNN), and one-hop ConceptNet concepts. Feature embeddings are fused via stacked GCN layers, residual addition, and auto-regressive generation (Jing et al., 2023). Object-metadata and knowledge integration yield a BLEU-4 lift from 4.26 to 33.16.

TURBO

Explicitly models sarcasm targets, integrating them as special tokens and embeddings throughout enrichment, graph convolution, and a shared fusion mechanism atop BART (Goel et al., 11 Feb 2025). Visual streams are stratified into low (BLIP captions), medium (YOLOv9 object tags), and high (ViT patch embeddings) levels. Target supervision and shared fusion boost performance approximately +3.3% over TEAM in BLEU, ROUGE, METEOR.

MOSES

Employs spotlight-aware, multimodal context attention for code-mixed dialogue. BART-based encoder-decoders integrate text, audio, and visual features via context-aware attention and global information fusion, with explicit pronunciation embeddings for code-mixed robustness (Kumar et al., 2022).

EDGE

Constructs a sentiment-enhanced context graph where nodes include utterance tokens, utterance-level sentiment (via BabelSenticNet), and video/audio sentiment (via joint cross-attention based inference). GCN propagation across these nodes strengthens modeling of subtle sentiment contrasts, a hallmark of sarcasm in conversation (Ouyang et al., 2024).

MuVaC

Formulates MuSE jointly with detection as a structural causal model: multimodal features $M$ generate explanations $E$ , which induce latent features $F$ feeding into the sarcasm label $Y$ . Variational learning aligns $q(F|E')$ (from gold explanations) with $p(F|\hat E)$ (from generated explanations), enforcing consistency between detection and explanation (Guo et al., 28 Jan 2026).

VLM-Driven Prompt Approaches (MuSE pipeline)

A training-free mechanism extracts fine-grained objects (Fast R-CNN), augments with ConceptNet knowledge, and assembles LVLM prompts ("Image objects: ... Related concepts: ... Caption: ... Explain why this is ironic.") for zero-shot explanation generation (Wang et al., 5 Aug 2025, Basnet et al., 13 Oct 2025). Modular ablations reveal the complementary effect of both object and knowledge augmentation.

4. Evaluation Metrics and Comparative Outcomes

MuSE research employs multidimensional evaluation on both automatic and human axes:

Automatic metrics: BLEU-n (n=1..4), ROUGE-{1,2,L}, METEOR, BERTScore, Sent-BERT cosine (Desai et al., 2021, Jing et al., 2023, Goel et al., 11 Feb 2025). Some works include CLIPScore for vision-language grounding (Basnet et al., 13 Oct 2025).
- Example: TURBO achieves BLEU-4 = 35.26 vs. TEAM 33.16; ROUGE-L = 53.12 vs. 50.58 (Goel et al., 11 Feb 2025).
- Sentiment and knowledge-aware models show marked improvements over text-only or plain fusion baselines (+2–3% ROUGE-BLEU gains).
Human evaluation: Criteria typically include fluency, semantic adequacy, target mention, negative connotation, and relevance to sarcasm. TURBO surpasses TEAM by up to +10.6% on target presence (Goel et al., 11 Feb 2025).
Qualitative diagnostics: Large Vision-LLMs (VLMs) and LVLM-augmented pipelines perform competitively in zero/one-shot setups but commonly omit explicit targets or misattribute the origin of the sarcasm (Wang et al., 5 Aug 2025, Basnet et al., 13 Oct 2025).

5. External Knowledge, Targeting, and Causal Reasoning

External information and causal modeling are central in closing the explanatory gap between literal and intended meaning in MuSE:

External knowledge: ConceptNet is extensively used for expanding object and caption tokens, introducing background concepts crucial to the sarcasm's logic (Jing et al., 2023, Goel et al., 11 Feb 2025, Wang et al., 5 Aug 2025). Integration is mainly via graph-structured augmentation and prompt concatenation.
Sarcasm target modeling: TURBO is the first to systematically encode the explicit sarcasm target throughout the model stack, appending it to enriched sequences and learning a target-aware fusion (Goel et al., 11 Feb 2025). Empirically, target inclusion is necessary for accurate rationale generation.
Causal and variational objectives: MuVaC enforces a causal chain between explanation generation and sarcasm detection via ELBO optimization, penalizing explanation-detection inconsistency at the level of latent explanation features (Guo et al., 28 Jan 2026). This enforces that explanations contain features essential for valid detection.
Sentiment reasoning: EDGE’s context graph incorporates utterance and multimodal affect signals, with propagation enhanced by semantic-sentiment weighted edges, reflecting the essential role of sentiment incongruity in sarcasm (Ouyang et al., 2024).

6. Large Vision-LLMs and Structured Rationales

Recent advancements leverage instruction-tuned vision-LLMs and few-shot prompting:

MuSE pipeline (training-free): LVLMs (LLaVA, InstructBLIP, etc.) are augmented with object, attribute, and commonsense enrichments via structured prompts to elicit explanations zero-shot. Performance gains are modest but consistent, with object + knowledge ablation studies showing complementary importance (Wang et al., 5 Aug 2025).
Dual expert and rationale distillation: MiDRE employs internal (content-only) and external (Chain-of-Thought, LVLM-derived) experts, adaptively gated at each encoder layer, surfacing structured rationales as stepwise human-readable justifications for the sarcasm verdict. Chain-of-Thought rationales encode world knowledge and contextual associations lacking in deep feature-based reasoning (Jana et al., 6 Jul 2025).
Evaluation limitations: Despite respectable BLEU and ROUGE, VLM explanations often underperform on grounding and fluency relative to tailored fusion models. Only models with explicit target or knowledge integration reliably mention the correct focus of ridicule and the required negative connotation (Basnet et al., 13 Oct 2025).

7. Limitations, Challenges, and Future Directions

Contemporary MuSE research highlights the following unresolved challenges:

Data limitations: Current datasets (3–5K examples) restrict model scaling, especially for deep NLG architectures (Desai et al., 2021).
Target acquisition: Most models rely on human-annotated sarcasm targets; robust, fully-automatic target prediction remains open (Goel et al., 11 Feb 2025).
Knowledge granularity: External knowledge integration is deterministic and context-agnostic; learning dynamic, context-sensitive retrieval is a natural extension (Goel et al., 11 Feb 2025).
OCR and embedded text: Lack of explicit OCR feature encoding in most models hinders full exploitation of textual cues in images (Goel et al., 11 Feb 2025).
Causal evaluation: Causal pathways are modeled in MuVaC, but broader adoption across architectures is pending (Guo et al., 28 Jan 2026).
Sentiment fusion: More sophisticated modeling of dynamic, multi-scale sentiment is needed for dialogue and code-mixed settings (Ouyang et al., 2024).
Generalization to VLMs: Off-the-shelf VLMs detect sarcasm moderately but fail to generate fully grounded, human-style explanations without fine-tuning or retrieval augmentation (Basnet et al., 13 Oct 2025, Wang et al., 5 Aug 2025).

Future research is expected to explore: end-to-end fusion of target and context with dynamic retrieval, retrieval-augmented or hypergraph-based fusion architectures, multi-stage or Chain-of-Thought prompting for richer rationales, integration of OCR signals, and human-in-the-loop evaluation for iterative refinement (Goel et al., 11 Feb 2025, Basnet et al., 13 Oct 2025, Wang et al., 5 Aug 2025). The combinatorial complexity of the task, the need for both cross-modal and commonsense reasoning, and the ongoing evolution of foundational models ensure this remains a rapidly advancing research frontier.