UGotMe: An Embodied System for Affective Human-Robot Interaction

Published 24 Oct 2024 in cs.RO and cs.HC | (2410.18373v2)

Abstract: Equipping humanoid robots with the capability to understand emotional states of human interactants and express emotions appropriately according to situations is essential for affective human-robot interaction. However, enabling current vision-aware multimodal emotion recognition models for affective human-robot interaction in the real-world raises embodiment challenges: addressing the environmental noise issue and meeting real-time requirements. First, in multiparty conversation scenarios, the noises inherited in the visual observation of the robot, which may come from either 1) distracting objects in the scene or 2) inactive speakers appearing in the field of view of the robot, hinder the models from extracting emotional cues from vision inputs. Secondly, realtime response, a desired feature for an interactive system, is also challenging to achieve. To tackle both challenges, we introduce an affective human-robot interaction system called UGotMe designed specifically for multiparty conversations. Two denoising strategies are proposed and incorporated into the system to solve the first issue. Specifically, to filter out distracting objects in the scene, we propose extracting face images of the speakers from the raw images and introduce a customized active face extraction strategy to rule out inactive speakers. As for the second issue, we employ efficient data transmission from the robot to the local server to improve realtime response capability. We deploy UGotMe on a human robot named Ameca to validate its real-time inference capabilities in practical scenarios. Videos demonstrating real-world deployment are available at https://pi3-141592653.github.io/UGotMe/.

Abstract PDF HTML Upgrade to Chat

References (34)

Summary

The paper presents UGotMe, an embodied system addressing practical challenges like visual noise and real-time response in affective human-robot interaction (HRI) within multiparty conversational settings.
UGotMe employs a novel customized active face extraction strategy that uses sound direction and robot orientation to identify and focus on the active speaker's face, effectively filtering irrelevant visual input from bystanders.
Experimental validation on the Ameca robot shows UGotMe achieved 77.29% emotion response accuracy and a 7.89 user experience score, significantly outperforming baselines and demonstrating the critical impact of its active speaker identification.

The UGotMe system presents an approach for enabling affective human-robot interaction (HRI) within multiparty conversational settings, focusing on addressing the practical embodiment challenges encountered when deploying vision-aware multimodal emotion recognition models on physical robots (2410.18373). The system integrates perception, modeling, and execution phases to allow a humanoid robot, specifically the Ameca platform used for validation, to perceive human emotional states and respond with appropriate facial expressions in near real-time.

System Architecture and Workflow

UGotMe operates through a distributed architecture involving the humanoid robot (Ameca) and a local server ("on-edge").

Perception (Robot): The robot utilizes its onboard sensors, primarily its camera and microphone array. The visual stream captures the interaction scene. Audio is captured and transcribed into text using a cloud-based speech-to-text service.
Data Transmission: Raw RGB video frames are continuously streamed from the robot's camera to the local server via TCP using the ZMQ library. This occurs in a background thread to minimize interference with other robot functions. Transcribed text corresponding to utterances is also sent to the server.
Processing (Server):
- Buffering: The server maintains a temporal buffer of the most recent T video frames (e.g., 640 frames at 25 FPS, representing 25.6 seconds).
- Denoising: Upon receiving transcribed text indicating the end of an utterance, the system applies visual denoising strategies to the buffered frames corresponding to that utterance. This involves initial face extraction followed by a customized active face extraction technique to isolate the current speaker's face.
- Emotion Recognition: The denoised face sequence and the transcribed text (along with conversational history) are fed into the Vision-Language to Emotion (VL2E) model. This model performs multimodal emotion recognition, classifying the speaker's emotion into one of seven categories (neutral, joy, surprise, sadness, anger, disgust, fear).
Execution (Robot): The recognized emotion label is transmitted back to the robot. UGotMe maps this label directly to a predefined facial expression on the Ameca robot. The robot executes this expression, aiming for parallel empathy (mimicking the detected emotion).

Embodiment Challenges in Multiparty HRI

The core contribution of UGotMe lies in addressing two specific challenges that arise when transitioning multimodal emotion recognition from controlled datasets to real-world, embodied HRI, particularly in multiparty scenarios:

Environmental Noise in Visual Input: The robot's visual perception is often contaminated by elements irrelevant to the active speaker's emotional state. UGotMe identifies two primary noise sources:
- Distracting Objects: Background clutter or dynamic elements in the scene unrelated to the interactants.
- Inactive Speakers/Irrelevant Persons: The presence of multiple individuals within the robot's field of view, where only one is speaking at any given time. The facial expressions of non-speakers can introduce conflicting or irrelevant visual cues to the emotion recognition model.
Real-Time Response Requirement: Natural and engaging HRI necessitates that the robot perceives, processes, and responds to emotional cues with minimal latency. Achieving this is hindered by:
- Computational Latency: The inference time of potentially large deep learning models for emotion recognition.
- Communication Latency: Delays in transmitting sensor data (especially high-bandwidth video) from the robot to the processing server and sending commands back.

Visual Denoising Strategies

To mitigate the impact of visual noise, UGotMe incorporates a two-stage denoising pipeline applied on the server-side:

Face Extraction: As a preliminary step, standard face detection algorithms are applied to the raw image frames to extract facial regions. This eliminates non-facial background clutter from subsequent processing.
Customized Active Face Extraction: This novel strategy addresses the challenge of identifying the active speaker among multiple individuals in the scene. It leverages the robot's physical embodiment and multimodal sensing:
- The robot uses its microphone array to determine the direction of arrival (DoA) of the sound (speech).
- The robot's control system actuates its head and left-eye camera to orient towards the estimated sound source, attempting to center the active speaker horizontally within the camera's field of view.
- On the server, the system analyzes the face detections within the frames corresponding to the utterance. It selects the face whose horizontal position (x-coordinate) is closest to the center of the image frame as the active speaker's face. Faces significantly deviating from the center are discarded as likely belonging to inactive speakers or bystanders.
- This exploits the assumption that the robot's orienting behavior successfully centralizes the speaker.

Additionally, the system employs a form of person-specific neutral normalization, inspired by prior work, by creating "delta images" – subtracting a reference neutral face image from the current face image. In the deployment, where pre-recorded neutral faces are unavailable, the first frame of the interaction sequence serves as the reference.

Real-Time Performance Mechanisms

UGotMe employs several techniques to optimize for low-latency interaction:

Continuous Data Streaming: Instead of intermittent capture or batch processing, RGB frames are streamed continuously as byte data from the robot to the server in a dedicated background thread.
Server-Side Buffering: Maintaining a rolling buffer of the most recent frames on the server ensures that the relevant visual context for a just-completed utterance is immediately available for processing without waiting for data transfer.
Offloaded Computation: The computationally intensive components, particularly the VL2E model inference and visual denoising, are executed on the more powerful local server, freeing up the robot's onboard resources.
Efficient Communication Protocol: The use of ZMQ over TCP provides a robust and relatively efficient mechanism for inter-process communication between the robot and the server.

This combination allows the system to begin processing the visual and textual data associated with an utterance almost immediately after the utterance concludes (signaled by the reception of the transcribed text), minimizing the delay between human expression and robot response.

Vision-Language to Emotion (VL2E) Model

The core emotion recognition component is the VL2E model. It is designed specifically for conversational emotion recognition using visual and textual modalities. It accepts sequences of face images (processed by the denoising strategies) and corresponding transcribed text. The paper reports that VL2E was evaluated on the MELD dataset, achieving a state-of-the-art weighted average F1 score of 67.29%, outperforming several baseline models, including those utilizing audio information which VL2E omits in its final deployment configuration. Ablation studies on MELD confirmed the utility of both modalities and the benefit of face extraction and neutral normalization techniques.

Experimental Validation on Ameca

The UGotMe system, integrating the VL2E model and the proposed denoising strategies, was deployed on the Ameca humanoid robot for real-world validation in scripted multiparty interactions. Human participants rated the interactions based on:

Emotion Response Accuracy: Whether the robot's facial expression correctly matched the perceived emotion of the human speaker (%).
User Experience (UX): A subjective rating on a 0-10 scale.

Three system configurations were compared:

UGotMe-VL2E: The full proposed system.
UGotMe-TelME*: Using a reproduced baseline model (TelME, without its original audio component) instead of VL2E.
UGotMe-VL2E^1: Using VL2E but disabling the customized active face extraction strategy (relying only on basic face extraction).

The results demonstrated the effectiveness of the full UGotMe system:

System	Avg. Emotion Response Accuracy (%)	Avg. User Experience (0-10)
UGotMe-VL2E	77.29	7.89
`UGotMe-TelME*`	50.63	6.15
`UGotMe-VL2E^1`	59.08	6.63

The significantly higher accuracy (77.29%) and UX score (7.89) of the full UGotMe-VL2E system compared to the baselines strongly validate the practical benefit of both the VL2E model and, critically, the customized active face extraction strategy for handling noise from inactive speakers in real-world multiparty scenarios. The difference between UGotMe-VL2E and UGotMe-VL2E^1 highlights the substantial impact (+18.21% accuracy) of the active speaker identification mechanism.

In conclusion, the UGotMe system provides a practical framework for implementing affective HRI in complex multiparty settings. Its primary contributions lie in identifying and proposing concrete solutions—specifically the active face extraction strategy coupled with efficient data handling—to the often-overlooked embodiment challenges of visual noise and real-time responsiveness, demonstrating marked improvements in interaction quality through real-world robotic deployment.

Markdown Report Issue