Mapping User Trust in Vision Language Models: Research Landscape, Challenges, and Prospects

Published 8 May 2025 in cs.CV, cs.AI, cs.CY, cs.HC, and cs.RO | (2505.05318v1)

Abstract: The rapid adoption of Vision LLMs (VLMs), pre-trained on large image-text and video-text datasets, calls for protecting and informing users about when to trust these systems. This survey reviews studies on trust dynamics in user-VLM interactions, through a multi-disciplinary taxonomy encompassing different cognitive science capabilities, collaboration modes, and agent behaviours. Literature insights and findings from a workshop with prospective VLM users inform preliminary requirements for future VLM trust studies.

Abstract PDF Upgrade to Chat

Summary

Mapping User Trust in Vision LLMs

In the paper entitled "Mapping User Trust in Vision LLMs: Research Landscape, Challenges, and Prospects," the authors present a critical examination of Vision LLMs (VLMs) and their rapid integration into diverse applications, especially those demanding user trust. The study aims to explore the dynamics of trust in VLM interactions, acknowledging the unique methodological and interdisciplinary challenges posed by these models. The paper commences with an analysis of the transformative influence VLMs have had in reducing dependency on curated datasets, thereby enabling zero-shot inference in novel contexts. This capability, although beneficial, introduces significant risks, particularly in high-stakes environments such as autonomous systems and healthcare robotics, where errors may lead to severe consequences.

The theoretical framework for trustworthy AI employed in the paper is rooted in foundational concepts of ability, benevolence, and integrity, as articulated in Mayer's ABI model. The authors adapt this framework to VLMs by incorporating concepts from cognitive science such as Situated Cognition and Theory of Mind, thus systematically exploring how VLMs' capabilities—like intuitive physics and psychology, causality, and compositionality—affect user trust. Additionally, the framework examines collaborative modes of human-artificial intelligence interaction, a crucial element for augmenting trust, which includes collaborative planning, learning, and ideation.

Methodologically, the paper employs an extensive literature review protocol to identify relevant studies and datasets within the expansive domain of VLMs, focusing explicitly on trust and integrity concerns. Through this, it identifies significant trends centering around hallucinations and adversarial vulnerabilities, while recognizing a relative paucity of research into user perception of VLM performance, and broader cognitive capabilities such as causal reasoning. Furthermore, the authors investigate various benchmarks where scene graphs serve as promising structures to reconcile visual and textual inputs, enhancing the legibility and explicability of model responses.

A notable aspect of the paper is the report on an exploratory workshop tailored to address user trust in VLMs through direct interaction. This workshop involved design and engineering experts evaluating VLM responses against human responses in tasks derived from the STAR benchmark dataset, revealing discrepancies in model abilities related to visual perception and reasoning, and underscoring users' distrust in existing VLM capabilities. It led to insights on the significance of interaction and engagement for assessing trust, suggesting that future studies should prioritize user agency, diverse modalities, and continuous trust tracking.

In terms of implications, the authors advocate for a multidisciplinary approach towards VLM deployment in real-world scenarios, involving stakeholders from technical, philosophical, and ethical landscapes. They caution against the over-reliance on language modalities, urging the integration of graphical representations for holistic trust assessments. The paper emphasizes bridging user expectations with regulatory frameworks to safeguard against ethical and performance oversights of VLMs, ultimately guiding research towards more inclusive, representative methodologies that ensure accountable and transparent AI development.

Overall, "Mapping User Trust in Vision LLMs" stands as a comprehensive exploration of the varied dimensions underpinning trust in the fast-evolving field of AI, proposing a systematic taxonomy and outlining prerequisites that could pave the way for nuanced, substantive research across domains. As VLMs become increasingly prevalent, the paper serves as a critical resource in harmonizing technological innovation with public trust, ensuring ethical deployment and facilitating robust human-AI collaboration in complex tasks.

Markdown Report Issue