Papers
Topics
Authors
Recent
Search
2000 character limit reached

SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization

Published 28 Oct 2024 in cs.CV | (2410.21411v1)

Abstract: Social relation reasoning aims to identify relation categories such as friends, spouses, and colleagues from images. While current methods adopt the paradigm of training a dedicated network end-to-end using labeled image data, they are limited in terms of generalizability and interpretability. To address these issues, we first present a simple yet well-crafted framework named {\name}, which combines the perception capability of Vision Foundation Models (VFMs) and the reasoning capability of LLMs within a modular framework, providing a strong baseline for social relation recognition. Specifically, we instruct VFMs to translate image content into a textual social story, and then utilize LLMs for text-based reasoning. {\name} introduces systematic design principles to adapt VFMs and LLMs separately and bridge their gaps. Without additional model training, it achieves competitive zero-shot results on two databases while offering interpretable answers, as LLMs can generate language-based explanations for the decisions. The manual prompt design process for LLMs at the reasoning phase is tedious and an automated prompt optimization method is desired. As we essentially convert a visual classification task into a generative task of LLMs, automatic prompt optimization encounters a unique long prompt optimization issue. To address this issue, we further propose the Greedy Segment Prompt Optimization (GSPO), which performs a greedy search by utilizing gradient information at the segment level. Experimental results show that GSPO significantly improves performance, and our method also generalizes to different image styles. The code is available at https://github.com/Mengzibin/SocialGPT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. A domain based approach to social relation recognition. In CVPR, pages 3481–3490, 2017.
  2. Multi-granularity reasoning for social relation recognition from images. In ICME, pages 1618–1623, 2019.
  3. Graph-based social relation reasoning. In ECCV, pages 18–34. Springer, 2020.
  4. Shifted gcn-gat and cumulative-transformer based social relation recognition for long videos. In ACM MM, pages 67–76, 2023.
  5. Individual and group behavior-based customer profile model for personalized product recommendation. Expert Systems with Applications, 36(2):1932–1939, 2009.
  6. Deep reasoning with knowledge graph for social relationship understanding. In IJCAI, 2018.
  7. Faster r-cnn: Towards real-time object detection with region proposal networks. NeurIPS, 28, 2015.
  8. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  9. Mask r-cnn. In ICCV, pages 2961–2969, 2017.
  10. Bridgenet: A continuity-aware probabilistic network for age estimation. In CVPR, pages 1145–1154, 2019.
  11. Segmenter: Transformer for semantic segmentation. In ICCV, pages 7262–7272, 2021.
  12. End-to-end object detection with transformers. In ECCV, pages 213–229. Springer, 2020.
  13. Dual-glance model for deciphering social relationships. In ICCV, pages 2650–2659, 2017.
  14. Social relation reasoning based on triangular constraints. In AAAI, pages 737–745, 2023.
  15. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  16. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  17. Segment anything. In ICCV, 2023.
  18. Langsplat: 3d language gaussian splatting. In CVPR, pages 20051–20060, 2024.
  19. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  20. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022.
  21. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  22. Self-consistency improves chain of thought reasoning in language models. In ICLR, 2023.
  23. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  24. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
  25. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In EMNLP, pages 4222–4235, 2020.
  26. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  27. OpenAI. Gpt-4 technical report, 2023.
  28. Ordinalclip: Learning rank prompts for language-guided ordinal regression. NeurIPS, 35:35313–35325, 2022.
  29. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  30. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  31. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  32. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  33. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  34. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  35. Emerging properties in self-supervised vision transformers. In ICCV, pages 9650–9660, 2021.
  36. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  37. Grounded language-image pre-training. In CVPR, pages 10965–10975, 2022.
  38. Zero-shot text-to-image generation. In ICML, pages 8821–8831. PMLR, 2021.
  39. Learning to prompt for vision-language models. IJCV, 130(9):2337–2348, 2022.
  40. Visual classification via description from large language models. In ICLR, 2023.
  41. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pages 12888–12900. PMLR, 2022.
  42. Humans need not label more humans: Occlusion copy & paste for occluded human instance segmentation. In BMVC, 2022.
  43. Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners. In CVPR, pages 15211–15222, 2023.
  44. Clip-adapter: Better vision-language models with feature adapters. IJCV, pages 1–15, 2023.
  45. Conditional prompt learning for vision-language models. In CVPR, pages 16816–16825, 2022.
  46. Prompting large language models with answer heuristics for knowledge-based visual question answering. In CVPR, pages 14974–14983, 2023.
  47. From images to textual prompts: Zero-shot visual question answering with frozen large language models. In CVPR, pages 10867–10877, 2023.
  48. Vipergpt: Visual inference via python execution for reasoning. In ICCV, 2023.
  49. Foundation model drives weakly incremental learning for semantic segmentation. In CVPR, pages 23685–23694, 2023.
  50. Sheldon Cohen. Social relationships and health. American psychologist, 59(8):676, 2004.
  51. A circumplex model for interpersonal personality traits. Journal of personality and social psychology, 40(4):701, 1981.
  52. Daphne Blunt Bugental. Acquisition of the algorithms of social life: a domain-based approach. Psychological bulletin, 126(2):187, 2000.
  53. Alan P Fiske. The four elementary forms of sociality: framework for a unified theory of social relations. Psychological review, 99(4):689, 1992.
  54. An end-to-end network for generating social relationship graphs. In CVPR, pages 11186–11195, 2019.
  55. Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design. arXiv preprint arXiv:2303.07839, 2023.
  56. Mathprompter: Mathematical reasoning using large language models. arXiv preprint arXiv:2303.05398, 2023.
  57. Are nlp models really able to solve simple math word problems? In NAACL, pages 2080–2094, 2021.
  58. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  59. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382, 2023.
  60. Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
  61. Rethinking the role of demonstrations: What makes in-context learning work? In EMNLP, pages 11048–11064, 2022.
  62. Learning to retrieve prompts for in-context learning. In NAACL, pages 2655–2671, 2022.
  63. Large language models are human-level prompt engineers. In ICLR, 2023.
  64. Automatic prompt optimization with “gradient descent” and beam search. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, EMNLP, pages 7957–7968, 2023.
  65. Improving language understanding by generative pre-training. 2018.
  66. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  67. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
  68. Visual instruction tuning, 2023.
Citations (1)

Summary

  • The paper introduces SocialGPT, a framework that leverages Vision Foundation Models and LLMs to generate social 'stories' from images for zero-shot relation recognition.
  • It employs a novel Greedy Segment Prompt Optimization algorithm to refine prompt segments and enhance LLM reasoning performance.
  • Experiments on PIPA and PISC benchmarks validate SocialGPT's ability to generalize across diverse image styles without dedicated training.

Insights into SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization

The paper "SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization" introduces a modular framework for recognizing social relationships in images using a novel combination of Vision Foundation Models (VFMs) and LLMs. This approach, termed SocialGPT, addresses the limitations of existing end-to-end deep learning methods that often struggle with generalization and interpretability.

Social relationship recognition identifies categories such as friends, spouses, and colleagues from input images. Traditional approaches capture these relations by training specialized networks directly on image data. However, these techniques can be restrictive as they rely heavily on labeled data and often lack interpretability, giving rise to concerns about their applicability to unseen scenarios.

The SocialGPT framework is founded on the premise of utilizing the perception abilities of VFMs and the reasoning prowess of LLMs. This separation allows the LLM to reason about image content that has been transformed into descriptive text, enabling it to apply its language-based reasoning capabilities to draw conclusions about social relations. By reframing visual classification as a generative task for LLMs, SocialGPT introduces a unique challenge related to prompt optimization, which is typically a crucial phase for prompting LLMs.

At the heart of SocialGPT is a two-phase process that leverages the complementary strengths of VFMs and LLMs. The VFMs are tasked with perceiving visual content and translating it into a 'social story'—a textual representation of the image. This includes both general information and specific, task-oriented descriptions. Utilizing symbols to refer to specific objects ensures clarity and efficiency in following reasoning tasks.

The study introduces a structured prompt, coined SocialPrompt, for instructing LLMs in reasoning tasks. SocialPrompt is comprised of four segments: system, expectation, context, and guidance. These segments are critical in helping the LLM understand the task, anticipate the form of the expected output, and provide an example of reasoning from perception to conclusion. This systematic design ensures that SocialGPT achieves competitive zero-shot results without additional model training on dedicated data.

To address the challenge of optimizing lengthy prompts, the authors propose the Greedy Segment Prompt Optimization (GSPO) algorithm. GSPO incrementally searches for better prompt segments using a candidate set for each segment and gradient information at the segment level. This greedy optimization approach significantly enhances the performance and adaptability of SocialGPT to different image styles and tasks. By enabling detailed and nuanced communication with LLMs, GSPO aids in achieving a more precise and contextually relevant output.

SocialGPT is validated on two benchmarks: PIPA and PISC datasets. Impressively, it achieves competitive accuracy without training on the datasets, showcasing its potent zero-shot capabilities. The ablation studies reveal the significance of each component in the social story generation and reaffirms the efficacy of the modular framework. The framework demonstrates generalizability by adapting well to diverse image styles, such as sketches and cartoons, which traditional models fail to handle.

The practical implications of SocialGPT are manifold. It opens avenues for deployment in applications requiring human-like interpretative abilities on visual content, ranging from security systems to social media analytics. Theoretically, this research paves the way for more integrated approaches leveraging multi-modal AI, marrying vision and language to address complex cognitive tasks.

From a forward-looking perspective, advancements in foundation models, both visual and linguistic, will likely continue to enhance SocialGPT's efficiency and accuracy. Addressing foundational model biases and ensuring ethical use of AI technologies are identified as critical areas for future exploration, aligning with broader AI safety and fairness objectives.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

GitHub

Tweets

Sign up for free to view the 2 tweets with 7 likes about this paper.