UGotMe: An Embodied System for Affective Human-Robot Interaction
Abstract: Equipping humanoid robots with the capability to understand emotional states of human interactants and express emotions appropriately according to situations is essential for affective human-robot interaction. However, enabling current vision-aware multimodal emotion recognition models for affective human-robot interaction in the real-world raises embodiment challenges: addressing the environmental noise issue and meeting real-time requirements. First, in multiparty conversation scenarios, the noises inherited in the visual observation of the robot, which may come from either 1) distracting objects in the scene or 2) inactive speakers appearing in the field of view of the robot, hinder the models from extracting emotional cues from vision inputs. Secondly, realtime response, a desired feature for an interactive system, is also challenging to achieve. To tackle both challenges, we introduce an affective human-robot interaction system called UGotMe designed specifically for multiparty conversations. Two denoising strategies are proposed and incorporated into the system to solve the first issue. Specifically, to filter out distracting objects in the scene, we propose extracting face images of the speakers from the raw images and introduce a customized active face extraction strategy to rule out inactive speakers. As for the second issue, we employ efficient data transmission from the robot to the local server to improve realtime response capability. We deploy UGotMe on a human robot named Ameca to validate its real-time inference capabilities in practical scenarios. Videos demonstrating real-world deployment are available at https://pi3-141592653.github.io/UGotMe/.
- T. Baltrušaitis, M. Mahmoud, and P. Robinson, “Cross-dataset learning and person-specific normalisation for automatic action unit detection,” in 2015 11th IEEE international conference and workshops on automatic face and gesture recognition (FG), 2015.
- A. Mehrabian, “Communication without words,” in Communication theory, 2017.
- S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “Meld: A multimodal multi-party dataset for emotion recognition in conversations,” arXiv preprint arXiv:1810.02508, 2018.
- H. Zhu, C. Yu, and A. Cangelosi, “Affective human-robot interaction with multimodal explanations,” in International Conference on Social Robotics, 2022.
- M. K. Chowdary, T. N. Nguyen, and D. J. Hemanth, “Deep learning-based facial emotion recognition for human–computer interaction applications,” Neural Computing and Applications, 2023.
- Y. Maeda, T. Sakai, K. Kamei, and E. W. Cooper, “Human-robot interaction based on facial expression recognition using deep learning,” in 2020 Joint 11th International Conference on Soft Computing and Intelligent Systems and 21st International Symposium on Advanced Intelligent Systems, 2020.
- Y. Maeda and S. Geshi, “Human-robot interaction using markovian emotional model based on facial recognition,” in 2018 Joint 10th International Conference on Soft Computing and Intelligent Systems (SCIS) and 19th International Symposium on Advanced Intelligent Systems, 2018.
- A. Esfandbod, Z. Rokhi, A. Taheri, M. Alemi, and A. Meghdari, “Human-robot interaction based on facial expression imitation,” in 2019 7th international conference on robotics and Mechatronics, 2019.
- A. Ghorbandaei Pour, A. Taheri, M. Alemi, and A. Meghdari, “Human–robot facial expression reciprocal interaction platform: case studies on children with autism,” International Journal of Social Robotics, 2018.
- Z. Liu, M. Wu, W. Cao, L. Chen, J. Xu, R. Zhang, M. Zhou, and J. Mao, “A facial expression emotion recognition based human-robot interaction system.” IEEE CAA J. Autom. Sinica, 2017.
- L. Chen, K. Wang, M. Li, M. Wu, W. Pedrycz, and K. Hirota, “K-means clustering-based kernel canonical correlation analysis for multimodal emotion recognition in human–robot interaction,” IEEE Transactions on Industrial Electronics, 2022.
- F. Cid, L. J. Manso, and P. Núnez, “A novel multimodal emotion recognition approach for affective human robot interaction,” Proceedings of fine, 2015.
- H.-W. Jung, Y.-H. Seo, M. S. Ryoo, and H. S. Yang, “Affective communication system with multimodality for a humanoid robot, ami,” in 4th IEEE/RAS International Conference on Humanoid Robots, 2004., 2004.
- Y. Ma, K. L. Nguyen, F. Z. Xing, and E. Cambria, “A survey on empathetic dialogue systems,” Information Fusion, 2020.
- T. Yun, H. Lim, J. Lee, and M. Song, “Telme: Teacher-leading multimodal fusion network for emotion recognition in conversation,” arXiv preprint arXiv:2401.12987, 2024.
- G. Hu, T.-E. Lin, Y. Zhao, G. Lu, Y. Wu, and Y. Li, “Unimse: Towards unified multimodal sentiment analysis and emotion recognition,” arXiv preprint arXiv:2211.11256, 2022.
- Z. Li, F. Tang, M. Zhao, and Y. Zhu, “Emocaps: Emotion capsule based model for conversational emotion recognition,” arXiv preprint arXiv:2203.13504, 2022.
- W. Zheng, J. Yu, R. Xia, and S. Wang, “A facial expression-aware multimodal multi-task learning framework for emotion recognition in multi-party conversations,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023.
- V. Chudasama, P. Kar, A. Gudmalwar, N. Shah, P. Wasnik, and N. Onoe, “M2fnet: Multi-modal fusion network for emotion recognition in conversation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE signal processing letters, 2016.
- T. Baltrušaitis, P. Robinson, and L.-P. Morency, “Openface: an open source facial behavior analysis toolkit,” in 2016 IEEE winter conference on applications of computer vision, 2016.
- S. Zafeiriou and M. Petrou, “Sparse representations for facial expressions recognition via l1 optimization,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, 2010.
- C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Proceedings of the AAAI conference on artificial intelligence, 2017.
- D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation from scratch,” arXiv preprint arXiv:1411.7923, 2014.
- X. Song, L. Huang, H. Xue, and S. Hu, “Supervised prototypical contrastive learning for emotion recognition in conversation,” arXiv preprint arXiv:2210.08713, 2022.
- P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” ACM Computing Surveys, 2023.
- T. Gao, X. Yao, and D. Chen, “Simcse: Simple contrastive learning of sentence embeddings,” arXiv preprint arXiv:2104.08821, 2021.
- Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” in Proceedings of the conference. Association for computational linguistics. Meeting, 2019.
- N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, and E. Cambria, “Dialoguernn: An attentive rnn for emotion detection in conversations,” in Proceedings of the AAAI conference on artificial intelligence, 2019.
- D. Zhang, L. Wu, C. Sun, S. Li, Q. Zhu, and G. Zhou, “Modeling both context-and speaker-sensitive dependence for emotion detection in multi-speaker conversations.” in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019.
- J. Hu, Y. Liu, J. Zhao, and Q. Jin, “Mmgcn: Multimodal fusion via deep graph convolution network for emotion recognition in conversation,” arXiv preprint arXiv:2107.06779, 2021.
- W. Shen, S. Wu, Y. Yang, and X. Quan, “Directed acyclic graph network for conversational emotion recognition,” arXiv preprint arXiv:2105.12907, 2021.
- D. Hu, X. Hou, L. Wei, L. Jiang, and Y. Mo, “Mm-dfn: Multimodal dynamic fusion network for emotion recognition in conversations,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing, 2022.
- J. Li, X. Wang, G. Lv, and Z. Zeng, “Ga2mif: graph and attention based two-stage multi-source information fusion for conversational emotion detection,” IEEE Transactions on affective computing, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.