GSON: A Group-based Social Navigation Framework with Large Multimodal Model

Published 26 Sep 2024 in cs.RO and cs.AI | (2409.18084v2)

Abstract: With the increasing presence of service robots and autonomous vehicles in human environments, navigation systems need to evolve beyond simple destination reach to incorporate social awareness. This paper introduces GSON, a novel group-based social navigation framework that leverages Large Multimodal Models (LMMs) to enhance robots' social perception capabilities. Our approach uses visual prompting to enable zero-shot extraction of social relationships among pedestrians and integrates these results with robust pedestrian detection and tracking pipelines to overcome the inherent inference speed limitations of LMMs. The planning system incorporates a mid-level planner that sits between global path planning and local motion planning, effectively preserving both global context and reactive responsiveness while avoiding disruption of the predicted social group. We validate GSON through extensive real-world mobile robot navigation experiments involving complex social scenarios such as queuing, conversations, and photo sessions. Comparative results show that our system significantly outperforms existing navigation approaches in minimizing social perturbations while maintaining comparable performance on traditional navigation metrics.

Abstract PDF Upgrade to Chat

Summary

The paper introduces GSON, a framework that integrates visual reasoning from large multimodal models to estimate social group structures for effective robotic navigation.
It employs a hierarchical planning system combining global, mid-level, and local motion planners to create socially compliant paths in dynamic environments.
Experimental results in simulation and real-world trials show improved social awareness and reduced disturbances compared to traditional navigation methods.

The paper "GSON: A Group-based Social Navigation Framework with Large Multimodal Model" presents an advanced system designed to enhance the social navigation capabilities of mobile robots. This research is particularly relevant in the context of the growing deployment of service robots and autonomous vehicles in environments populated by humans, where understanding social dynamics is crucial for navigation.

The core innovation of this work is the integration of a group-based social navigation framework, termed GSON, that leverages the visual reasoning capabilities of Large Multimodal Models (LMMs). The framework addresses the dual challenges of perception and planning in dynamic social environments. Key components of the system include a social group estimation module and a socially aware planning module, each employing sophisticated AI techniques to achieve enhanced performance and robustness.

The system begins with a robust pedestrian detection and tracking pipeline that combines 2D LiDAR and RGB camera data. This ensures accurate tracking of individuals over time, essential for further processing. The social group estimation module utilizes visual prompting with LMMs to perform zero-shot reasoning about social structures. By analyzing interactions from RGB images, LMMs predict the social grouping of individuals, effectively interpreting complex scenarios like queues and conversational gatherings. Improved inferencing speed is achieved by maintaining a keyframe buffer, ensuring the model is only queried when significant changes in the scene occur.

Socially Aware Planning

Planning is handled in a three-tiered architecture comprising global path planning, a novel mid-level planner, and a local motion planner. The mid-level planner acts as an intermediary, bridging the gap between global intent and local adaptability. By incorporating social structures into the cost map, this planner generates paths that respect social conventions, minimizing disruptions to group activities. An NMPC augmented with Control Barrier Functions ensures the local planner produces safe, real-time trajectories that conform to the planned path.

Experimental Validation

The framework demonstrates its efficacy through both simulated and real-world experiments. In well-controlled simulation environments featuring common social scenarios, GSON exhibits superior social awareness and reduced disturbance to humans. Real-world trials further corroborate these findings, showing GSON's ability to navigate human-populated spaces more gracefully compared to several baseline methods, including the Timed Elastic Band planner and Dynamic Window Approach.

Theoretical and Practical Implications

The implications of this work are manifold. Theoretically, it provides insights into effective strategies for integrating large-scale multimodal reasoning models into real-time robotic systems. Practically, the framework offers a scalable solution for deploying socially adept robots in various settings, from service industries to urban transport. However, the reliance on LMMs highlights a minor trade-off in computational demand, necessitating careful management of model querying to maintain real-time performance.

Future Research Directions

Given the promising results, future research may explore scaling the approach to even denser environments, incorporating more nuanced human behaviors such as group merging and splitting. Additionally, techniques for distilling LMM capabilities into more efficient models could further enhance the system’s applicability in resource-constrained scenarios.

In summary, this paper makes a significant contribution to social navigation in robotics, offering a well-rounded approach by bridging complex perception tasks with practical navigation solutions. The integration of LMMs opens doors for more profound contextual understanding, paving the way for robots that coexist harmoniously in human-centered environments.