Papers
Topics
Authors
Recent
Search
2000 character limit reached

VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding

Published 4 Dec 2024 in cs.CV | (2412.03735v2)

Abstract: Multimodal LLMs (MLLMs) have recently shown significant advancements in video understanding, excelling in content reasoning and instruction-following tasks. However, hallucination, where models generate inaccurate or misleading content, remains underexplored in the video domain. Building on the observation that MLLM visual encoders often fail to distinguish visually different yet semantically similar video pairs, we introduce VidHalluc, the largest benchmark designed to examine hallucinations in MLLMs for video understanding. It consists of 5,002 videos, paired to highlight cases prone to hallucinations. VidHalluc assesses hallucinations across three critical dimensions: (1) action, (2) temporal sequence, and (3) scene transition. Comprehensive testing shows that most MLLMs are vulnerable to hallucinations across these dimensions. Furthermore, we propose DINO-HEAL, a training-free method that reduces hallucinations by incorporating spatial saliency from DINOv2 to reweight visual features during inference. Our results show that DINO-HEAL consistently improves performance on VidHalluc, achieving an average improvement of 3.02% in mitigating hallucinations across all tasks. Both the VidHalluc benchmark and DINO-HEAL code are available at https://people-robots.github.io/vidhalluc.

Summary

  • The paper introduces VidHalluc, a benchmark for evaluating hallucinations in multimodal video models by assessing action, temporal sequence, and scene transitions.
  • It employs a semi-automated pipeline combining GPT-4 annotations with human review to ensure reliable evaluation of hallucination types.
  • The study proposes DINO-HEAL, a training-free method that improves spatial understanding using saliency maps, revealing strengths and challenges of current MLLMs.

Overview of VidHalluc: Evaluating Hallucinations in MLLMs

Multimodal LLMs (MLLMs) have achieved notable progress in video understanding tasks, especially in terms of semantic reasoning and instruction-following. However, the phenomenon of hallucination—where models generate plausible yet factually incorrect content—poses challenges that require dedicated attention in the domain of video processing. The VidHalluc benchmark, introduced as the largest of its kind, specifically addresses this issue by evaluating hallucinations in MLLMs across three critical dimensions: action, temporal sequence, and scene transition.

Benchmark Construction and Methodology

VidHalluc has been meticulously developed, comprising a total of 5,002 video pairs selected based on high semantic similarity and low visual similarity. The benchmark employs a semi-automated pipeline involving GPT-4-generated annotations, which are further filtered through human review to ensure accuracy in action and scene recognition. Figure 1

Figure 1: Overview of the VidHalluc benchmark construction process, illustrating the multi-step selection and validation of video pairs for hallucination evaluation.

Types of Hallucinations

The benchmark assesses three primary types of hallucinations:

  1. Action Hallucination: This type occurs when MLLMs identify actions that differ significantly from the actual content. Action recognition capabilities are assessed through Binary QA and multiple-choice questions, each designed to probe the model's ability to discern correct actions amidst adversarial queries.
  2. Temporal Sequence Hallucination: Models are evaluated on their capacity to accurately represent the order of events within a video, addressing potential misinterpretations of chronological sequences.
  3. Scene Transition Hallucination: This aspect involves the model's ability to detect and describe transitions between distinct scenes, highlighting potential inaccuracies in spatial recognition. Figure 2

    Figure 2: Examples of the three hallucination types within the VidHalluc benchmark.

Mitigation Strategy: DINO-HEAL

To address these hallucinations, the research introduces DINO-HEAL, a novel training-free algorithm aimed at enhancing the robustness of visual encoders by focusing on salient spatial regions using saliency maps from DINOv2. Figure 3

Figure 3: DINO-HEAL pipeline illustrating the integration of DINOv2 saliency maps to improve spatial weighting in video frames.

Experimental Results

Extensive evaluations reveal vulnerabilities in current MLLMs, with most models struggling with all three types of hallucinations. Proprietary models such as GPT-4o displayed superior performance compared to open-source alternatives, indicating potential advantages from larger datasets and advanced fine-tuning methods. Figure 4

Figure 4: Comparative results on VidHalluc, demonstrating the range of performance across different models and hallucination types.

Conclusion

VidHalluc serves as an essential resource for understanding and mitigating hallucinations in video-based MLLMs. The introduction of DINO-HEAL marks a significant step forward, providing an efficient method for enhancing model robustness without the need for retraining. Future research efforts should focus on expanding the scope of hallucination categories and refining methodologies to improve temporal and spatial modeling in MLLM architectures.

Paper to Video (Beta)

Whiteboard

Explain it Like I'm 14

What’s this paper about?

This paper looks at a common problem in AI that watches videos: “hallucinations.” That’s when an AI confidently says something that isn’t true about what it saw. The authors build a big test set (called VidHalluc) to check how often video AIs make three kinds of mistakes:

  • Action mistakes: saying the wrong thing is happening.
  • Order mistakes: getting the sequence of events wrong.
  • Scene-change mistakes: missing or misreporting when the place changes.

They also introduce a simple add-on (called DINO-HEAL) that helps these AIs focus on the most important parts of each video frame so they hallucinate less—without retraining the whole model.

What questions were the researchers asking?

  • Can we create a large, reliable test to catch when video AIs “make stuff up,” especially about actions, timing, and scene changes?
  • How well do today’s top video-LLMs handle these tricky cases?
  • Is there a quick, training-free way to make these models less likely to hallucinate?

How did they do it?

Building the VidHalluc benchmark

The team created VidHalluc, the largest test set of its kind:

  • Size: 5,002 videos and 9,295 questions.
  • Focus: Three types of hallucinations—actions, event order, and scene transitions.

To make the test both challenging and fair, they paired videos that mean similar things but look different. Think of two videos both about “making tea,” but one shows boiling water first and the other starts with putting tea leaves in a cup. These “same idea, different looks” pairs are good at tricking models.

Here’s how they built it (in everyday terms):

  • Two “checkers” picked video pairs:
    • Meaning checker: finds videos that share a similar overall idea (like CLIP/SigLIP models do).
    • Looks checker: finds videos that visually look different (like DINOv2 does).
  • Automatic helper: GPT-4 read the original video captions to pull out actions and scenes.
  • Human review: People double-checked to make sure the actions and scenes really matched or differed as expected.
  • Question types:
    • Action hallucination: yes/no questions and multiple-choice (“What is the main action?”).
    • Order hallucination: sorting questions to get the order of events right.
    • Scene-change hallucination: “Did a scene change happen?” and “From what to what?”

What is DINO-HEAL, and how does it help?

DINO-HEAL is like giving the AI a highlighter for each frame:

  • It uses “saliency maps” (heatmaps of what’s important) from a vision model called DINOv2.
  • These maps reweight the video features so the model pays more attention to key regions (like hands, tools, or moving objects) and less to unimportant background.
  • It’s training-free: you don’t have to retrain the whole model—just apply this at test time.
  • It works with many common video encoders (like CLIP and SigLIP).

Why this helps: Many video AIs rely on visual encoders trained to match images with text. That can make them focus on general context (“kitchen,” “stadium”) instead of the tiny details that prove exactly what action is happening or when it happens. DINO-HEAL nudges the focus back toward the most important bits.

What did they find, and why is it important?

  • Most models hallucinate across all three areas, especially with action and order.
    • Many did much better on multiple-choice action questions than yes/no questions. With choices, models can compare; with yes/no, they must be sure—and that’s where they slip up.
    • For event order, many models fell below 50% accuracy and often collapsed separate actions into one, missing the sequence.
    • For scene changes, most models struggled to reliably detect and describe transitions.
  • Bigger isn’t always better. A larger model or more frames didn’t automatically mean better performance. Models with higher-resolution vision modules generally did better than those with lower resolution.
  • Commercial models did best overall. GPT-4o and Gemini-1.5-Pro were stronger and closer to human performance, with GPT-4o near human levels on several tasks.
  • DINO-HEAL improves results without training:
    • Action questions (yes/no): around +5% improvement for some models.
    • Event order: big gains (about +12% and +19% for two models tested).
    • Scene-change: smaller gains, likely because the method emphasizes foreground objects (people and things) more than background shifts.

Why this matters:

  • Video AIs are used in real-world tools (assistive tech, training videos, robots, safety systems). If they hallucinate about what’s happening or when, outcomes can be wrong or even unsafe.
  • A large, realistic benchmark helps developers spot and fix these weaknesses.
  • A simple, training-free fix like DINO-HEAL is practical for teams without massive computing resources.

What’s the bigger picture?

VidHalluc gives the community a tough, real-world test for video understanding that focuses on things videos are uniquely good at—actions over time and scene changes—not just static objects. The results show we still have work to do: models can be fooled when videos have similar meanings but different looks, and they’re especially shaky on the order of events.

DINO-HEAL shows a promising, easy-to-use way to cut down hallucinations by helping models “look” in the right places. Going forward, the authors suggest:

  • Expanding to more types of hallucinations.
  • Combining spatial and temporal “highlighters” to better capture both what is important and when it matters.

In short, this research gives us a clearer, more honest picture of what video AIs can and can’t do today—and a practical step toward making them more trustworthy.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (3)

Collections

Sign up for free to add this paper to one or more collections.