Papers
Topics
Authors
Recent
Search
2000 character limit reached

UGotMe: An Embodied System for Affective Human-Robot Interaction

Published 24 Oct 2024 in cs.RO and cs.HC | (2410.18373v2)

Abstract: Equipping humanoid robots with the capability to understand emotional states of human interactants and express emotions appropriately according to situations is essential for affective human-robot interaction. However, enabling current vision-aware multimodal emotion recognition models for affective human-robot interaction in the real-world raises embodiment challenges: addressing the environmental noise issue and meeting real-time requirements. First, in multiparty conversation scenarios, the noises inherited in the visual observation of the robot, which may come from either 1) distracting objects in the scene or 2) inactive speakers appearing in the field of view of the robot, hinder the models from extracting emotional cues from vision inputs. Secondly, realtime response, a desired feature for an interactive system, is also challenging to achieve. To tackle both challenges, we introduce an affective human-robot interaction system called UGotMe designed specifically for multiparty conversations. Two denoising strategies are proposed and incorporated into the system to solve the first issue. Specifically, to filter out distracting objects in the scene, we propose extracting face images of the speakers from the raw images and introduce a customized active face extraction strategy to rule out inactive speakers. As for the second issue, we employ efficient data transmission from the robot to the local server to improve realtime response capability. We deploy UGotMe on a human robot named Ameca to validate its real-time inference capabilities in practical scenarios. Videos demonstrating real-world deployment are available at https://pi3-141592653.github.io/UGotMe/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. T. Baltrušaitis, M. Mahmoud, and P. Robinson, “Cross-dataset learning and person-specific normalisation for automatic action unit detection,” in 2015 11th IEEE international conference and workshops on automatic face and gesture recognition (FG), 2015.
  2. A. Mehrabian, “Communication without words,” in Communication theory, 2017.
  3. S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “Meld: A multimodal multi-party dataset for emotion recognition in conversations,” arXiv preprint arXiv:1810.02508, 2018.
  4. H. Zhu, C. Yu, and A. Cangelosi, “Affective human-robot interaction with multimodal explanations,” in International Conference on Social Robotics, 2022.
  5. M. K. Chowdary, T. N. Nguyen, and D. J. Hemanth, “Deep learning-based facial emotion recognition for human–computer interaction applications,” Neural Computing and Applications, 2023.
  6. Y. Maeda, T. Sakai, K. Kamei, and E. W. Cooper, “Human-robot interaction based on facial expression recognition using deep learning,” in 2020 Joint 11th International Conference on Soft Computing and Intelligent Systems and 21st International Symposium on Advanced Intelligent Systems, 2020.
  7. Y. Maeda and S. Geshi, “Human-robot interaction using markovian emotional model based on facial recognition,” in 2018 Joint 10th International Conference on Soft Computing and Intelligent Systems (SCIS) and 19th International Symposium on Advanced Intelligent Systems, 2018.
  8. A. Esfandbod, Z. Rokhi, A. Taheri, M. Alemi, and A. Meghdari, “Human-robot interaction based on facial expression imitation,” in 2019 7th international conference on robotics and Mechatronics, 2019.
  9. A. Ghorbandaei Pour, A. Taheri, M. Alemi, and A. Meghdari, “Human–robot facial expression reciprocal interaction platform: case studies on children with autism,” International Journal of Social Robotics, 2018.
  10. Z. Liu, M. Wu, W. Cao, L. Chen, J. Xu, R. Zhang, M. Zhou, and J. Mao, “A facial expression emotion recognition based human-robot interaction system.” IEEE CAA J. Autom. Sinica, 2017.
  11. L. Chen, K. Wang, M. Li, M. Wu, W. Pedrycz, and K. Hirota, “K-means clustering-based kernel canonical correlation analysis for multimodal emotion recognition in human–robot interaction,” IEEE Transactions on Industrial Electronics, 2022.
  12. F. Cid, L. J. Manso, and P. Núnez, “A novel multimodal emotion recognition approach for affective human robot interaction,” Proceedings of fine, 2015.
  13. H.-W. Jung, Y.-H. Seo, M. S. Ryoo, and H. S. Yang, “Affective communication system with multimodality for a humanoid robot, ami,” in 4th IEEE/RAS International Conference on Humanoid Robots, 2004., 2004.
  14. Y. Ma, K. L. Nguyen, F. Z. Xing, and E. Cambria, “A survey on empathetic dialogue systems,” Information Fusion, 2020.
  15. T. Yun, H. Lim, J. Lee, and M. Song, “Telme: Teacher-leading multimodal fusion network for emotion recognition in conversation,” arXiv preprint arXiv:2401.12987, 2024.
  16. G. Hu, T.-E. Lin, Y. Zhao, G. Lu, Y. Wu, and Y. Li, “Unimse: Towards unified multimodal sentiment analysis and emotion recognition,” arXiv preprint arXiv:2211.11256, 2022.
  17. Z. Li, F. Tang, M. Zhao, and Y. Zhu, “Emocaps: Emotion capsule based model for conversational emotion recognition,” arXiv preprint arXiv:2203.13504, 2022.
  18. W. Zheng, J. Yu, R. Xia, and S. Wang, “A facial expression-aware multimodal multi-task learning framework for emotion recognition in multi-party conversations,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023.
  19. V. Chudasama, P. Kar, A. Gudmalwar, N. Shah, P. Wasnik, and N. Onoe, “M2fnet: Multi-modal fusion network for emotion recognition in conversation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  20. K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE signal processing letters, 2016.
  21. T. Baltrušaitis, P. Robinson, and L.-P. Morency, “Openface: an open source facial behavior analysis toolkit,” in 2016 IEEE winter conference on applications of computer vision, 2016.
  22. S. Zafeiriou and M. Petrou, “Sparse representations for facial expressions recognition via l1 optimization,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, 2010.
  23. C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Proceedings of the AAAI conference on artificial intelligence, 2017.
  24. D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation from scratch,” arXiv preprint arXiv:1411.7923, 2014.
  25. X. Song, L. Huang, H. Xue, and S. Hu, “Supervised prototypical contrastive learning for emotion recognition in conversation,” arXiv preprint arXiv:2210.08713, 2022.
  26. P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” ACM Computing Surveys, 2023.
  27. T. Gao, X. Yao, and D. Chen, “Simcse: Simple contrastive learning of sentence embeddings,” arXiv preprint arXiv:2104.08821, 2021.
  28. Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” in Proceedings of the conference. Association for computational linguistics. Meeting, 2019.
  29. N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, and E. Cambria, “Dialoguernn: An attentive rnn for emotion detection in conversations,” in Proceedings of the AAAI conference on artificial intelligence, 2019.
  30. D. Zhang, L. Wu, C. Sun, S. Li, Q. Zhu, and G. Zhou, “Modeling both context-and speaker-sensitive dependence for emotion detection in multi-speaker conversations.” in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019.
  31. J. Hu, Y. Liu, J. Zhao, and Q. Jin, “Mmgcn: Multimodal fusion via deep graph convolution network for emotion recognition in conversation,” arXiv preprint arXiv:2107.06779, 2021.
  32. W. Shen, S. Wu, Y. Yang, and X. Quan, “Directed acyclic graph network for conversational emotion recognition,” arXiv preprint arXiv:2105.12907, 2021.
  33. D. Hu, X. Hou, L. Wei, L. Jiang, and Y. Mo, “Mm-dfn: Multimodal dynamic fusion network for emotion recognition in conversations,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing, 2022.
  34. J. Li, X. Wang, G. Lv, and Z. Zeng, “Ga2mif: graph and attention based two-stage multi-source information fusion for conversational emotion detection,” IEEE Transactions on affective computing, 2023.

Summary

  • The paper presents UGotMe, an embodied system addressing practical challenges like visual noise and real-time response in affective human-robot interaction (HRI) within multiparty conversational settings.
  • UGotMe employs a novel customized active face extraction strategy that uses sound direction and robot orientation to identify and focus on the active speaker's face, effectively filtering irrelevant visual input from bystanders.
  • Experimental validation on the Ameca robot shows UGotMe achieved 77.29% emotion response accuracy and a 7.89 user experience score, significantly outperforming baselines and demonstrating the critical impact of its active speaker identification.

The UGotMe system presents an approach for enabling affective human-robot interaction (HRI) within multiparty conversational settings, focusing on addressing the practical embodiment challenges encountered when deploying vision-aware multimodal emotion recognition models on physical robots (2410.18373). The system integrates perception, modeling, and execution phases to allow a humanoid robot, specifically the Ameca platform used for validation, to perceive human emotional states and respond with appropriate facial expressions in near real-time.

System Architecture and Workflow

UGotMe operates through a distributed architecture involving the humanoid robot (Ameca) and a local server ("on-edge").

  1. Perception (Robot): The robot utilizes its onboard sensors, primarily its camera and microphone array. The visual stream captures the interaction scene. Audio is captured and transcribed into text using a cloud-based speech-to-text service.
  2. Data Transmission: Raw RGB video frames are continuously streamed from the robot's camera to the local server via TCP using the ZMQ library. This occurs in a background thread to minimize interference with other robot functions. Transcribed text corresponding to utterances is also sent to the server.
  3. Processing (Server):
    • Buffering: The server maintains a temporal buffer of the most recent T video frames (e.g., 640 frames at 25 FPS, representing 25.6 seconds).
    • Denoising: Upon receiving transcribed text indicating the end of an utterance, the system applies visual denoising strategies to the buffered frames corresponding to that utterance. This involves initial face extraction followed by a customized active face extraction technique to isolate the current speaker's face.
    • Emotion Recognition: The denoised face sequence and the transcribed text (along with conversational history) are fed into the Vision-Language to Emotion (VL2E) model. This model performs multimodal emotion recognition, classifying the speaker's emotion into one of seven categories (neutral, joy, surprise, sadness, anger, disgust, fear).
  4. Execution (Robot): The recognized emotion label is transmitted back to the robot. UGotMe maps this label directly to a predefined facial expression on the Ameca robot. The robot executes this expression, aiming for parallel empathy (mimicking the detected emotion).

Embodiment Challenges in Multiparty HRI

The core contribution of UGotMe lies in addressing two specific challenges that arise when transitioning multimodal emotion recognition from controlled datasets to real-world, embodied HRI, particularly in multiparty scenarios:

  1. Environmental Noise in Visual Input: The robot's visual perception is often contaminated by elements irrelevant to the active speaker's emotional state. UGotMe identifies two primary noise sources:
    • Distracting Objects: Background clutter or dynamic elements in the scene unrelated to the interactants.
    • Inactive Speakers/Irrelevant Persons: The presence of multiple individuals within the robot's field of view, where only one is speaking at any given time. The facial expressions of non-speakers can introduce conflicting or irrelevant visual cues to the emotion recognition model.
  2. Real-Time Response Requirement: Natural and engaging HRI necessitates that the robot perceives, processes, and responds to emotional cues with minimal latency. Achieving this is hindered by:
    • Computational Latency: The inference time of potentially large deep learning models for emotion recognition.
    • Communication Latency: Delays in transmitting sensor data (especially high-bandwidth video) from the robot to the processing server and sending commands back.

Visual Denoising Strategies

To mitigate the impact of visual noise, UGotMe incorporates a two-stage denoising pipeline applied on the server-side:

  1. Face Extraction: As a preliminary step, standard face detection algorithms are applied to the raw image frames to extract facial regions. This eliminates non-facial background clutter from subsequent processing.
  2. Customized Active Face Extraction: This novel strategy addresses the challenge of identifying the active speaker among multiple individuals in the scene. It leverages the robot's physical embodiment and multimodal sensing:
    • The robot uses its microphone array to determine the direction of arrival (DoA) of the sound (speech).
    • The robot's control system actuates its head and left-eye camera to orient towards the estimated sound source, attempting to center the active speaker horizontally within the camera's field of view.
    • On the server, the system analyzes the face detections within the frames corresponding to the utterance. It selects the face whose horizontal position (x-coordinate) is closest to the center of the image frame as the active speaker's face. Faces significantly deviating from the center are discarded as likely belonging to inactive speakers or bystanders.
    • This exploits the assumption that the robot's orienting behavior successfully centralizes the speaker.

Additionally, the system employs a form of person-specific neutral normalization, inspired by prior work, by creating "delta images" – subtracting a reference neutral face image from the current face image. In the deployment, where pre-recorded neutral faces are unavailable, the first frame of the interaction sequence serves as the reference.

Real-Time Performance Mechanisms

UGotMe employs several techniques to optimize for low-latency interaction:

  1. Continuous Data Streaming: Instead of intermittent capture or batch processing, RGB frames are streamed continuously as byte data from the robot to the server in a dedicated background thread.
  2. Server-Side Buffering: Maintaining a rolling buffer of the most recent frames on the server ensures that the relevant visual context for a just-completed utterance is immediately available for processing without waiting for data transfer.
  3. Offloaded Computation: The computationally intensive components, particularly the VL2E model inference and visual denoising, are executed on the more powerful local server, freeing up the robot's onboard resources.
  4. Efficient Communication Protocol: The use of ZMQ over TCP provides a robust and relatively efficient mechanism for inter-process communication between the robot and the server.

This combination allows the system to begin processing the visual and textual data associated with an utterance almost immediately after the utterance concludes (signaled by the reception of the transcribed text), minimizing the delay between human expression and robot response.

Vision-Language to Emotion (VL2E) Model

The core emotion recognition component is the VL2E model. It is designed specifically for conversational emotion recognition using visual and textual modalities. It accepts sequences of face images (processed by the denoising strategies) and corresponding transcribed text. The paper reports that VL2E was evaluated on the MELD dataset, achieving a state-of-the-art weighted average F1 score of 67.29%, outperforming several baseline models, including those utilizing audio information which VL2E omits in its final deployment configuration. Ablation studies on MELD confirmed the utility of both modalities and the benefit of face extraction and neutral normalization techniques.

Experimental Validation on Ameca

The UGotMe system, integrating the VL2E model and the proposed denoising strategies, was deployed on the Ameca humanoid robot for real-world validation in scripted multiparty interactions. Human participants rated the interactions based on:

  • Emotion Response Accuracy: Whether the robot's facial expression correctly matched the perceived emotion of the human speaker (%).
  • User Experience (UX): A subjective rating on a 0-10 scale.

Three system configurations were compared:

  1. UGotMe-VL2E: The full proposed system.
  2. UGotMe-TelME*: Using a reproduced baseline model (TelME, without its original audio component) instead of VL2E.
  3. UGotMe-VL2E^1: Using VL2E but disabling the customized active face extraction strategy (relying only on basic face extraction).

The results demonstrated the effectiveness of the full UGotMe system:

System Avg. Emotion Response Accuracy (%) Avg. User Experience (0-10)
UGotMe-VL2E 77.29 7.89
UGotMe-TelME* 50.63 6.15
UGotMe-VL2E^1 59.08 6.63

The significantly higher accuracy (77.29%) and UX score (7.89) of the full UGotMe-VL2E system compared to the baselines strongly validate the practical benefit of both the VL2E model and, critically, the customized active face extraction strategy for handling noise from inactive speakers in real-world multiparty scenarios. The difference between UGotMe-VL2E and UGotMe-VL2E^1 highlights the substantial impact (+18.21% accuracy) of the active speaker identification mechanism.

In conclusion, the UGotMe system provides a practical framework for implementing affective HRI in complex multiparty settings. Its primary contributions lie in identifying and proposing concrete solutions—specifically the active face extraction strategy coupled with efficient data handling—to the often-overlooked embodiment challenges of visual noise and real-time responsiveness, demonstrating marked improvements in interaction quality through real-world robotic deployment.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.