SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization
Abstract: Social relation reasoning aims to identify relation categories such as friends, spouses, and colleagues from images. While current methods adopt the paradigm of training a dedicated network end-to-end using labeled image data, they are limited in terms of generalizability and interpretability. To address these issues, we first present a simple yet well-crafted framework named {\name}, which combines the perception capability of Vision Foundation Models (VFMs) and the reasoning capability of LLMs within a modular framework, providing a strong baseline for social relation recognition. Specifically, we instruct VFMs to translate image content into a textual social story, and then utilize LLMs for text-based reasoning. {\name} introduces systematic design principles to adapt VFMs and LLMs separately and bridge their gaps. Without additional model training, it achieves competitive zero-shot results on two databases while offering interpretable answers, as LLMs can generate language-based explanations for the decisions. The manual prompt design process for LLMs at the reasoning phase is tedious and an automated prompt optimization method is desired. As we essentially convert a visual classification task into a generative task of LLMs, automatic prompt optimization encounters a unique long prompt optimization issue. To address this issue, we further propose the Greedy Segment Prompt Optimization (GSPO), which performs a greedy search by utilizing gradient information at the segment level. Experimental results show that GSPO significantly improves performance, and our method also generalizes to different image styles. The code is available at https://github.com/Mengzibin/SocialGPT.
- A domain based approach to social relation recognition. In CVPR, pages 3481–3490, 2017.
- Multi-granularity reasoning for social relation recognition from images. In ICME, pages 1618–1623, 2019.
- Graph-based social relation reasoning. In ECCV, pages 18–34. Springer, 2020.
- Shifted gcn-gat and cumulative-transformer based social relation recognition for long videos. In ACM MM, pages 67–76, 2023.
- Individual and group behavior-based customer profile model for personalized product recommendation. Expert Systems with Applications, 36(2):1932–1939, 2009.
- Deep reasoning with knowledge graph for social relationship understanding. In IJCAI, 2018.
- Faster r-cnn: Towards real-time object detection with region proposal networks. NeurIPS, 28, 2015.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Mask r-cnn. In ICCV, pages 2961–2969, 2017.
- Bridgenet: A continuity-aware probabilistic network for age estimation. In CVPR, pages 1145–1154, 2019.
- Segmenter: Transformer for semantic segmentation. In ICCV, pages 7262–7272, 2021.
- End-to-end object detection with transformers. In ECCV, pages 213–229. Springer, 2020.
- Dual-glance model for deciphering social relationships. In ICCV, pages 2650–2659, 2017.
- Social relation reasoning based on triangular constraints. In AAAI, pages 737–745, 2023.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Segment anything. In ICCV, 2023.
- Langsplat: 3d language gaussian splatting. In CVPR, pages 20051–20060, 2024.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
- Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
- Self-consistency improves chain of thought reasoning in language models. In ICLR, 2023.
- Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In EMNLP, pages 4222–4235, 2020.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
- OpenAI. Gpt-4 technical report, 2023.
- Ordinalclip: Learning rank prompts for language-guided ordinal regression. NeurIPS, 35:35313–35325, 2022.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Emerging properties in self-supervised vision transformers. In ICCV, pages 9650–9660, 2021.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Grounded language-image pre-training. In CVPR, pages 10965–10975, 2022.
- Zero-shot text-to-image generation. In ICML, pages 8821–8831. PMLR, 2021.
- Learning to prompt for vision-language models. IJCV, 130(9):2337–2348, 2022.
- Visual classification via description from large language models. In ICLR, 2023.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pages 12888–12900. PMLR, 2022.
- Humans need not label more humans: Occlusion copy & paste for occluded human instance segmentation. In BMVC, 2022.
- Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners. In CVPR, pages 15211–15222, 2023.
- Clip-adapter: Better vision-language models with feature adapters. IJCV, pages 1–15, 2023.
- Conditional prompt learning for vision-language models. In CVPR, pages 16816–16825, 2022.
- Prompting large language models with answer heuristics for knowledge-based visual question answering. In CVPR, pages 14974–14983, 2023.
- From images to textual prompts: Zero-shot visual question answering with frozen large language models. In CVPR, pages 10867–10877, 2023.
- Vipergpt: Visual inference via python execution for reasoning. In ICCV, 2023.
- Foundation model drives weakly incremental learning for semantic segmentation. In CVPR, pages 23685–23694, 2023.
- Sheldon Cohen. Social relationships and health. American psychologist, 59(8):676, 2004.
- A circumplex model for interpersonal personality traits. Journal of personality and social psychology, 40(4):701, 1981.
- Daphne Blunt Bugental. Acquisition of the algorithms of social life: a domain-based approach. Psychological bulletin, 126(2):187, 2000.
- Alan P Fiske. The four elementary forms of sociality: framework for a unified theory of social relations. Psychological review, 99(4):689, 1992.
- An end-to-end network for generating social relationship graphs. In CVPR, pages 11186–11195, 2019.
- Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design. arXiv preprint arXiv:2303.07839, 2023.
- Mathprompter: Mathematical reasoning using large language models. arXiv preprint arXiv:2303.05398, 2023.
- Are nlp models really able to solve simple math word problems? In NAACL, pages 2080–2094, 2021.
- Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
- A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382, 2023.
- Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
- Rethinking the role of demonstrations: What makes in-context learning work? In EMNLP, pages 11048–11064, 2022.
- Learning to retrieve prompts for in-context learning. In NAACL, pages 2655–2671, 2022.
- Large language models are human-level prompt engineers. In ICLR, 2023.
- Automatic prompt optimization with “gradient descent” and beam search. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, EMNLP, pages 7957–7968, 2023.
- Improving language understanding by generative pre-training. 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
- Visual instruction tuning, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.