Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations
Abstract: Understanding social interactions involving both verbal and non-verbal cues is essential for effectively interpreting social situations. However, most prior works on multimodal social cues focus predominantly on single-person behaviors or rely on holistic visual representations that are not aligned to utterances in multi-party environments. Consequently, they are limited in modeling the intricate dynamics of multi-party interactions. In this paper, we introduce three new challenging tasks to model the fine-grained dynamics between multiple people: speaking target identification, pronoun coreference resolution, and mentioned player prediction. We contribute extensive data annotations to curate these new challenges in social deduction game settings. Furthermore, we propose a novel multimodal baseline that leverages densely aligned language-visual representations by synchronizing visual features with their corresponding utterances. This facilitates concurrently capturing verbal and non-verbal cues pertinent to social reasoning. Experiments demonstrate the effectiveness of the proposed approach with densely aligned multimodal representations in modeling fine-grained social interactions. Project website: https://sangmin-git.github.io/projects/MMSI.
- No gestures left behind: Learning relationships between spoken language and freeform gestures. In Findings of Conference on Empirical Methods in Natural Language Processing, pages 1884–1895, 2020.
- Continual learning for personalized co-speech gesture generation. In IEEE/CVF International Conference on Computer Vision, pages 20893–20903, 2023.
- Data-free class-incremental hand gesture recognition. In IEEE/CVF International Conference on Computer Vision, pages 20958–20967, 2023.
- Emotional speech corpus for persuasive dialogue system. In Language Resources and Evaluation Conference, pages 491–497, 2020.
- Human-side strategies in the werewolf game against the stealth werewolf strategy. In International Conference on Computers and Games, pages 93–102. Springer, 2016.
- Mafia: A theoretical study of players and coalitions in a partial information environment. The Annals of Applied Probability, 18(3):825–846, 2008.
- Jean Carletta. Assessing agreement on classification tasks: the kappa statistic. Computational Linguistics, 22(2):249–254, 1996.
- Casino: A corpus of campsite negotiation dialogues for automatic negotiation systems. arXiv preprint arXiv:2103.15721, 2021.
- Multivariate, multi-frequency and multimodal: Rethinking graph neural networks for emotion recognition in conversation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10761–10770, 2023.
- Are you awerewolf? detecting deceptive roles and outcomes in a conversational role-playing game. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5334–5337. IEEE, 2010.
- Detecting attended visual targets in video. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5396–5406, 2020.
- Robert Chuchro. Training an assassin ai for the resistance: Avalon. arXiv preprint arXiv:2209.09331, 2022.
- Electra: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations, 2020.
- Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022.
- Multiscale vision transformers. In IEEE/CVF International Conference on Computer Vision, pages 6824–6835, 2021.
- Inferring shared attention in social scene videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6460–6468, 2018.
- Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Dual attention guided gaze target detection in the wild. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11390–11399, 2021.
- Ego4d: Around the world in 3,000 hours of egocentric video. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
- A probabilistic model of gaze imitation and shared attention. Neural Networks, 19(3):299–310, 2006.
- Mm-dfn: Multimodal dynamic fusion network for emotion recognition in conversations. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7037–7041. IEEE, 2022a.
- Unimse: Towards unified multimodal sentiment analysis and emotion recognition. In Conference on Empirical Methods in Natural Language Processing, pages 7837–7851, 2022b.
- Higru: Hierarchical gated recurrent units for utterance-level emotion recognition. In Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies, pages 397–406, 2019.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies, pages 4171–4186, 2019.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- Klaus Krippendorff. Content analysis: An introduction to its methodology. Sage publications, 2018.
- In the eye of transformer: Global-local correlation for egocentric gaze estimation. In The British Machine Vision Conference, 2022.
- Werewolf among us: Multimodal resources for modeling persuasion behaviors in social deduction games. In Findings of the Association for Computational Linguistics, pages 6570–6588, 2023.
- Crossmodal clustered contrastive learning: Grounding of spoken language to gesture. In Companion Publication of International Conference on Multimodal Interaction, pages 202–210, 2021.
- Learning robust representations with information bottleneck and memory network for rgb-d-based gesture recognition. In IEEE/CVF International Conference on Computer Vision, pages 20968–20978, 2023.
- Decoupled representation learning for skeleton-based gesture recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5751–5760, 2020.
- Learning hierarchical cross-modal association for co-speech gesture generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10462–10472, 2022.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Speaker and time-aware joint contextual learning for dialogue-act classification in counselling conversations. In ACM International Conference on Web Search and Data Mining, pages 735–745, 2022.
- Constructing a human-like agent for the werewolf game using a psychological model based multiple perspectives. In IEEE Symposium Series on Computational Intelligence, pages 1–8. IEEE, 2016.
- Interaction-aware joint attention estimation using people attributes. In IEEE/CVF International Conference on Computer Vision, pages 10224–10233, 2023.
- External commonsense knowledge as a modality for social intelligence question-answering. In IEEE/CVF International Conference on Computer Vision Workshop, pages 3044–3050, 2023.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Mmlatch: Bottom-up top-down fusion for multimodal sentiment analysis. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4573–4577. IEEE, 2022.
- Dcr-net: A deep co-interactive relation network for joint dialog act recognition and sentiment classification. In AAAI Conference on Artificial Intelligence, pages 8665–8672, 2020.
- Co-gat: A co-interactive graph attention network for joint dialog act recognition and sentiment classification. In AAAI Conference on Artificial Intelligence, pages 13709–13717, 2021.
- Towards emotion-aided multi-modal dialogue act classification. In Annual Meeting of the Association for Computational Linguistics, pages 4361–4372, 2020.
- Finding friend and foe in multi-agent games. Advances in Neural Information Processing Systems, 32, 2019.
- Directed acyclic graph network for conversational emotion recognition. In Annual Meeting of the Association for Computational Linguistics and International Joint Conference on Natural Language Processing, pages 1551–1560, 2021.
- Attention flow: End-to-end joint attention estimation. In IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3327–3336, 2020.
- Vipul Raheja Joel Tetreault. Dialogue act classification with context-aware self-attention. In Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies, pages 3727–3733, 2019.
- Object-aware gaze target detection. In IEEE/CVF International Conference on Computer Vision, pages 21860–21869, 2023.
- End-to-end human-gaze-target detection with transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2192–2200. IEEE, 2022.
- Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
- Persuasion for good: Towards a personalized persuasive dialogue system for social good. In Annual Meeting of the Association for Computational Linguistics, pages 5635–5649, 2019.
- Multi-modal correlated network with emotional reasoning knowledge for social intelligence question-answering. In IEEE/CVF International Conference on Computer Vision Workshop, pages 3075–3081, 2023.
- Social-iq: A question answering benchmark for artificial social intelligence. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8807–8817, 2019.
- Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Annual Meeting of the Association for Computational Linguistics, pages 2236–2246, 2018.
- Attention in convolutional lstm for gesture recognition. Advances in Neural Information Processing Systems, 31, 2018.
- Knowledge-bridged causal interaction network for causal emotion entailment. In AAAI Conference on Artificial Intelligence, pages 14020–14028, 2023.
- Decoupling and recoupling spatiotemporal representation for rgb-d-based motion recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20154–20163, 2022.
- Topic-driven and knowledge-aware transformer for dialogue emotion detection. In Annual Meeting of the Association for Computational Linguistics and International Joint Conference on Natural Language, pages 1571–1582. Association for Computational Linguistics, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.