SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization

Published 28 Oct 2024 in cs.CV | (2410.21411v1)

Abstract: Social relation reasoning aims to identify relation categories such as friends, spouses, and colleagues from images. While current methods adopt the paradigm of training a dedicated network end-to-end using labeled image data, they are limited in terms of generalizability and interpretability. To address these issues, we first present a simple yet well-crafted framework named {\name}, which combines the perception capability of Vision Foundation Models (VFMs) and the reasoning capability of LLMs within a modular framework, providing a strong baseline for social relation recognition. Specifically, we instruct VFMs to translate image content into a textual social story, and then utilize LLMs for text-based reasoning. {\name} introduces systematic design principles to adapt VFMs and LLMs separately and bridge their gaps. Without additional model training, it achieves competitive zero-shot results on two databases while offering interpretable answers, as LLMs can generate language-based explanations for the decisions. The manual prompt design process for LLMs at the reasoning phase is tedious and an automated prompt optimization method is desired. As we essentially convert a visual classification task into a generative task of LLMs, automatic prompt optimization encounters a unique long prompt optimization issue. To address this issue, we further propose the Greedy Segment Prompt Optimization (GSPO), which performs a greedy search by utilizing gradient information at the segment level. Experimental results show that GSPO significantly improves performance, and our method also generalizes to different image styles. The code is available at https://github.com/Mengzibin/SocialGPT.

Abstract PDF HTML Upgrade to Chat

References (68)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces SocialGPT, a framework that leverages Vision Foundation Models and LLMs to generate social 'stories' from images for zero-shot relation recognition.
It employs a novel Greedy Segment Prompt Optimization algorithm to refine prompt segments and enhance LLM reasoning performance.
Experiments on PIPA and PISC benchmarks validate SocialGPT's ability to generalize across diverse image styles without dedicated training.

The paper "SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization" introduces a modular framework for recognizing social relationships in images using a novel combination of Vision Foundation Models (VFMs) and LLMs. This approach, termed SocialGPT, addresses the limitations of existing end-to-end deep learning methods that often struggle with generalization and interpretability.

Social relationship recognition identifies categories such as friends, spouses, and colleagues from input images. Traditional approaches capture these relations by training specialized networks directly on image data. However, these techniques can be restrictive as they rely heavily on labeled data and often lack interpretability, giving rise to concerns about their applicability to unseen scenarios.

The SocialGPT framework is founded on the premise of utilizing the perception abilities of VFMs and the reasoning prowess of LLMs. This separation allows the LLM to reason about image content that has been transformed into descriptive text, enabling it to apply its language-based reasoning capabilities to draw conclusions about social relations. By reframing visual classification as a generative task for LLMs, SocialGPT introduces a unique challenge related to prompt optimization, which is typically a crucial phase for prompting LLMs.

At the heart of SocialGPT is a two-phase process that leverages the complementary strengths of VFMs and LLMs. The VFMs are tasked with perceiving visual content and translating it into a 'social story'—a textual representation of the image. This includes both general information and specific, task-oriented descriptions. Utilizing symbols to refer to specific objects ensures clarity and efficiency in following reasoning tasks.

The study introduces a structured prompt, coined SocialPrompt, for instructing LLMs in reasoning tasks. SocialPrompt is comprised of four segments: system, expectation, context, and guidance. These segments are critical in helping the LLM understand the task, anticipate the form of the expected output, and provide an example of reasoning from perception to conclusion. This systematic design ensures that SocialGPT achieves competitive zero-shot results without additional model training on dedicated data.

To address the challenge of optimizing lengthy prompts, the authors propose the Greedy Segment Prompt Optimization (GSPO) algorithm. GSPO incrementally searches for better prompt segments using a candidate set for each segment and gradient information at the segment level. This greedy optimization approach significantly enhances the performance and adaptability of SocialGPT to different image styles and tasks. By enabling detailed and nuanced communication with LLMs, GSPO aids in achieving a more precise and contextually relevant output.

SocialGPT is validated on two benchmarks: PIPA and PISC datasets. Impressively, it achieves competitive accuracy without training on the datasets, showcasing its potent zero-shot capabilities. The ablation studies reveal the significance of each component in the social story generation and reaffirms the efficacy of the modular framework. The framework demonstrates generalizability by adapting well to diverse image styles, such as sketches and cartoons, which traditional models fail to handle.

The practical implications of SocialGPT are manifold. It opens avenues for deployment in applications requiring human-like interpretative abilities on visual content, ranging from security systems to social media analytics. Theoretically, this research paves the way for more integrated approaches leveraging multi-modal AI, marrying vision and language to address complex cognitive tasks.

From a forward-looking perspective, advancements in foundation models, both visual and linguistic, will likely continue to enhance SocialGPT's efficiency and accuracy. Addressing foundational model biases and ensuring ethical use of AI technologies are identified as critical areas for future exploration, aligning with broader AI safety and fairness objectives.

Markdown Report Issue