Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations

Published 9 May 2024 in cs.RO | (2405.06039v2)

Abstract: This research introduces the Bi-VLA (Vision-Language-Action) model, a novel system designed for bimanual robotic dexterous manipulation that seamlessly integrates vision for scene understanding, language comprehension for translating human instructions into executable code, and physical action generation. We evaluated the system's functionality through a series of household tasks, including the preparation of a desired salad upon human request. Bi-VLA demonstrates the ability to interpret complex human instructions, perceive and understand the visual context of ingredients, and execute precise bimanual actions to prepare the requested salad. We assessed the system's performance in terms of accuracy, efficiency, and adaptability to different salad recipes and human preferences through a series of experiments. Our results show a 100% success rate in generating the correct executable code by the Language Module, a 96.06% success rate in detecting specific ingredients by the Vision Module, and an overall success rate of 83.4% in correctly executing user-requested tasks.

Abstract PDF HTML Upgrade to Chat

References (29)

Citations (8)

View on Semantic Scholar

Summary

The paper introduces Bi-VLA, which integrates vision, language, and action modules to plan and execute bimanual robotic tasks.
It details a novel use of LLMs and VLMs to translate human instructions into executable Python code, achieving a 100% success rate in code generation.
Experimental results demonstrate 96% accuracy in ingredient detection, underscoring both the system’s strengths and the need for enhanced visual perception.

Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations

The presented paper details the development and evaluation of Bi-VLA, a vision-language-action-based system designed to enhance bimanual robotic dexterous manipulations. Targeting household tasks, Bi-VLA seamlessly integrates visual perception, language understanding, and action execution to accomplish complex manipulations, exemplified in salad preparation tasks.

System Architecture

Vision-Language-Action Integration

Bi-VLA employs a novel integration of vision, language, and action components to achieve dexterous robotic manipulation. The system takes advantage of LLMs and Vision LLMs (VLMs) to interpret human requests and plan coordinated actions for two robotic manipulators. The LLM acts as a semantic planner to generate detailed action plans, while the VLM processes visual inputs to ensure accurate object detection and context understanding.

Figure 1: System Overview of Bi-VLA.

Semantic Planning and Code Generation

The semantic planner, utilizing the capabilities of Starling-LM-7B-alpha, translates human instructions into actionable plans for the bimanual robotic system. The generated plans are subsequently converted into Python code leveraging an LLM-Based Code Generator, ensuring coherent API function calls that direct the motion controller segment of the system.

Figure 2: The LLM-based semantic planner receives user instructions and generates a detailed plan outlining the movements of two robot manipulators. The generated plan is translated into a set of motion API call functions inside a Python function through the LLM-based Code Generator. The execution of the generated code triggers the movement of the two robot manipulators through the Motion Controller.

Vision LLM and Coordination

Qwen-VL, a state-of-the-art VLM, serves as the vision component, tasked with object recognition and localization, essential for precise manipulation tasks. The model maps pixel coordinates to 3D space enabling accurate object grasping by the manipulators. This spatial awareness supports intricate tasks typical of household activities.

Experimental Setup and Workflow

Setup

The experimental setup features two UR3 collaborative robots equipped with a camera and a two-finger Robotiq gripper on one arm, and a tool-mounted end-effector on the other. Ingredients are visually captured, processed, and manipulated to demonstrate salad assembly tasks.

Figure 3: Cooking Experiment Workflow.

Workflow

Bi-VLA processes user requests through a Retrieval-Augmented Generation (RAG) framework, verifying ingredient availability using the vision component, followed by generation of a sequence of manipulation tasks. Upon successful verification, the manipulators execute the plan through coordinated motion, engaging in tasks such as cutting, placing, and mixing.

Evaluation and Results

The evaluation focused on the semantic planner's code generation capabilities and the VLM's ingredient detection accuracy. The system achieved a 100% success rate in generating correct executable code from user requests. The vision module excelled with a 96.06% success rate in detecting ingredients from images but showed reduced accuracy of 71.22% when partial ingredients were unavailable.

Figure 4: Step-by-step execution of the Semantic Plan: (a)-(b) the manipulator with the gripper moves to grasp the pepper and bring it to the cutting board; (c)-(d) the manipulator with the tool moves to the cutting board, cuts the pepper, and places it in the bowl; (e)-(f) the two manipulators return to their initial position.

The system's dependency on visual accuracy underscores a critical need for advancements in visual perception for further improvements in task adaptability and robustness.

Conclusion

Bi-VLA successfully integrates vision, language understanding, and robotic action to enable complex bimanual manipulations in real-world scenarios. The high success rates highlight its potential in household and similar applications, albeit with areas for refinement, particularly in the vision module. Future work will focus on enhancing visual perception capabilities to bolster system adaptability and response reliability in diverse environments, paving the way for more autonomous robotic assistants.

Markdown Report Issue