Multimodal Fact Verification

Updated 26 January 2026

Multimodal Fact Verification is the automated process that integrates text and visual evidence to assess the veracity of claims.
It employs advanced models like ViT and SBERT-MPNet to jointly analyze heterogeneous signals and resolve inconsistencies.
Benchmark datasets like FACTIFY 2 use balanced labels and rigorous protocols to enhance evaluation and guide model improvements.

Multimodal fact verification is the automated determination of whether a claim is true, false, or unverified by jointly reasoning over textual and visual (and sometimes other) evidence. The central challenge is that misinformation often exploits inconsistencies or manipulative associations across modalities—such as text–image pairs in social media posts, or satirical articles with provocative imagery. Multimodal verification systems must fuse signals from heterogeneous sources, resolve entailment at both the linguistic and perceptual level, and, in advanced frameworks, provide interpretable rationales and fine-grained entailment explanations.

1. Multimodal Fact Verification Datasets

The development of multimodal fact verification research has been driven by large, systematically constructed datasets, among which the FACTIFY series is the most prominent early resource. FACTIFY 2, a benchmark dataset of 50,000 claim–document pairs, is notable for its balanced label coverage and rigorous annotation protocol (Suryavardan et al., 2023).

Dataset Construction

Sources: Claims and supporting documents are collected from real news (major verified news Twitter handles), fake news and fact-checking sites (Snopes, Factly, Boom), and satirical news (Fauxy, EmpireNews).
Pairing and Annotation: Document–claim pairs are constructed via text and image similarity: SBERT (paraphrase-MiniLM-L6-v2) is used for text; ResNet-50 embeddings and histogram-based similarity for images. Each image is also run through OCR for embedded textual content.
Label Schema: Three coarse classes—Support, No-Evidence, Refute—subdivided by whether entailment is established unimodally or only via cross-modal agreement.

Label	Count	Description
Support_Multimodal	10,000	Both text and image support the claim
Support_Text	10,000	Only text supports; images mismatch
Insufficient_Multimodal	10,000	Only images match; text is insufficient
Insufficient_Text	10,000	Text is topically relevant but non-entailing
Refute	10,000	Text or image contradicts the claim

Annotation is guided by explicit rules: contradiction by any modality is prioritized; agreement in both modalities results in a Support_Multimodal label; partial matches are mapped to intermediate (Insufficient*) classes. Inter-annotator reliability achieves a Fleiss’ κ of 0.82, indicating strong agreement (Suryavardan et al., 2023).

2. Problem Formalism and Label Taxonomy

FACTIFY 2 frames multimodal fact verification as a five-way entailment task. Given a claim $(C_t, C_v)$ (text, image) and a candidate document $(D_t, D_v)$ (text, image), the classification model must assign one of the five fine-grained labels. The critical definitions:

Support_Text: $C_t$ entails $D_t$ ; images $C_v, D_v$ do not match.
Support_Multimodal: Both $C_t \Rightarrow D_t$ and $C_v \approx D_v$ .
Insufficient_Text: $C_t$ not entailed by $D_t$ (only topical overlap), and $C_v \not\approx D_v$ .
Insufficient_Multimodal: $C_t$ not entailed by $D_t$ , but $C_v \approx D_v$ .
Refute: $D_t$ and/or $D_v$ contradict $C_t$ (and $C_v$ ).

This schema supports granular evaluation of cross-modal entailment and directly exposes errors stemming from partial agreement, visual distractors, and textual subtleties.

3. Model Architectures and Training Protocols

The FACTIFY 2 benchmark introduces a modular fusion architecture as a baseline. Its design exemplifies the requirements for robust multimodal reasoning:

Text Encoder: Both claim and document texts are encoded with SBERT-MPNet (resulting in 768-dim vectors).
Image Encoder: Both images are encoded with ViT-Base (768-dim pooled outputs).
Joint Representation: Concatenate $h_\text{text} = [h^C_t; h^D_t]$ and $h_\text{vision} = [h^C_v; h^D_v]$ .
Classification Head: A projection and joint hidden representation, $\mathbf{h} = \mathrm{ReLU}(W_t h_\text{text} + W_v h_\text{vision} + b)$ , precedes a 5-way softmax classifier.

Training specifics: AdamW optimizer, initial learning rate $2 \times 10^{-5}$ , batch size 32, 5 epochs, 0.1 dropout, with early stopping based on macro-F1. This architecture reached a macro-F1 of 0.65 on the test set (Suryavardan et al., 2023).

4. Evaluation Metrics, Results, and Comparative Performance

Evaluation is performed via precision, recall, and F1 for each class and their macro-average:

$\mathrm{precision}_c = \frac{TP_c}{TP_c + FP_c}$

$\mathrm{recall}_c = \frac{TP_c}{TP_c + FN_c}$

$\mathrm{F1}_c = \frac{2 \cdot (\mathrm{precision}_c \cdot \mathrm{recall}_c)}{\mathrm{precision}_c + \mathrm{recall}_c}$

$\mathrm{Macro\text{-}F1} = \frac{1}{|C|} \sum_c \mathrm{F1}_c$

Key results on FACTIFY 2 test set, using the baseline ViT + SBERT-MPNet:

Macro-F1: 0.65 over five classes.
Per-class Macro-F1: Support 0.68, No-Evidence 0.64, Refute 0.62.

Comparative models achieve up to 0.6499 Macro-F1 (ViT + SBERT-MPNet), with marked improvements over ResNet-50 + SBERT baselines (0.4504–0.4727). The gap highlights the significance of vision transformers and strong contextualized text encoders for this domain (Suryavardan et al., 2023).

5. Error Analysis and Open Problems

Qualitative and quantitative analyses pinpoint core bottlenecks:

Visual Distractors: Generic or non-distinctive images often lead to false positives in Support_Multimodal or Insufficient_Multimodal.
Textual Ambiguity: Distinguishing Insufficient_Text from Support_Text demands fine-grained reading comprehension—modal verbs or hedges are frequent confounders.
Modality Conflict Handling: When text supports but image contradicts (or vice versa), misclassification as Support_Multimodal is common; annotation prioritizes contradiction, but models make systematic errors.
Error Modes: Visual confusion, insufficient textual distinctions, and improper resolution to "Insufficient" classes persist as recurrent weaknesses.

6. Future Research Directions

Several critical unsolved challenges and research priorities emerge:

Explainability: Models currently lack human-readable justifications for their verdicts, particularly for "Refute" labels.
Synthetic and Adversarial Cases: Augmenting the Refute class with adversarial (auto-generated or swapped) images could stress-test visual reasoning and robustness.
Cross-Modal Pretraining: Stronger alignment between $C_t$ and $C_v$ (vision-LLMs such as CLIP, ViLT) is hypothesized to yield improvements.
Temporal Reasoning: Detecting stale or context-shifted images is out of reach for existing systems, but crucial for real-world deployment.

7. Significance and Applications

The FACTIFY 2 resource and methodology exemplify the maturation of multimodal fact verification into a rigorous, large-scale evaluation task. Balanced label distribution, explicit entailment schema, and accessible baselines provide a foundation for advancing cross-modal misinformation detection. The macro-F1 ceiling of 0.65 indicates significant headroom for model improvement, especially under adversarial or ambiguous conditions. As the problem space expands to more modalities (video, audio), real-world deployment contexts (social media, journalism), and ever-larger scale, these principles and benchmarks will guide the field’s evolution (Suryavardan et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

Factify 2: A Multimodal Fake News and Satire News Dataset (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multimodal Fact Verification.