Shake-VLA: Robotic Cocktail System
- Shake-VLA is a vision-language-action robotic system that integrates object detection, OCR, ASR, and retrieval-augmented generation to automate cocktail preparation.
- The system employs a modular pipeline with YOLOv8, Whisper-1, and dual-arm coordination to achieve high accuracy in cluttered, noisy environments.
- Force-torque sensor feedback and symbolic anomaly detection ensure precise liquid dispensing and effective error handling during bimanual manipulation.
Shake-VLA is a Vision-Language-Action (VLA) model-based robotic system engineered for bimanual manipulation in automated cocktail preparation. It couples vision and language understanding with precise action generation, leveraging YOLOv8-based object detection, Optical Character Recognition (OCR), speech-to-text (ASR) conversion, Retrieval-Augmented Generation (RAG), anomaly detection, force-torque sensor feedback, and coordinated dual-arm manipulation. Experimental results demonstrate operational robustness in cluttered and noisy environments and seamless integration of perception, language, and physical interaction (Khan et al., 12 Jan 2025).
1. System Architecture and Core Modules
Shake-VLA employs a modular pipeline that transforms spoken natural-language instructions into sequenced, bimanual robotic actions with feedback-driven liquid dispersion. The principal modules include:
- Vision: YOLOv8 single-stage detector locates bottles and objects; EasyOCR extracts label text, with outputs rendered as structured JSON records.
- Speech-to-Text (ASR): OpenAI Whisper-1 model processes voice commands, trained on robust, noisy environments across diverse accents; employs token-level post-processing for utterance confidence computation.
- RAG + LLM: Recipe retrieval is performed via FAISS index with OpenAI text-embedding-ada-002; GPT-4o generates robot API calls (e.g.,
take_glass(),pour_liquid(q, tol)), incorporating retrieved recipe, user input, and anomaly signals. - Anomaly Detection: Symbolic recipe-ingredient matching identifies discrepancies, triggering user prompts or rule-based fallbacks if necessary.
- Force-Torque Sensor Integration: ATI Nano17 force-torque sensor on the UR3e arm provides real-time feedback for volumetric accuracy in pouring.
- Bimanual Manipulation: UR3 and UR3e arms, equipped with 2F-Robotiq parallel grippers, execute dexterous, collision-free motions coordinated via MoveIt RRT* planning and joint-space PD controllers.
Information flow proceeds as: user audio → speech-to-text → recipe retrieval/generation (with anomaly resolution) → object/label detection → robot API generation → sequencing of actions on hardware → FT sensor for adaptive pouring feedback.
2. Perception, Recognition, and Confidence Scoring
The vision module leverages YOLOv8 for detection, setting a confidence threshold:
where is the detection confidence for object . EasyOCR then extracts text from label bounding boxes, capable of reading both English and Russian in cluttered environments, but with observable failure modes on multi-line or low-contrast labels.
For ASR, OpenAI Whisper-1 outputs per-token probabilities , synthesizing a final utterance confidence:
Cluttered visual backgrounds and ambient audio noise are mitigated through robust model training, yielding recognition accuracies of 91% and 93% for vision and speech, respectively, across cluttered and noisy conditions.
3. Retrieval-Augmented Generation and Language-Guided Action
Shake-VLA's RAG module retrieves structured recipe embeddings via FAISS, indexed with OpenAI text-embedding-ada-002. The generative component (GPT-4o) synthesizes task instructions by conditioning on:
- User speech transcription
- Retrieved recipe content
- Anomaly detection module outputs
Generated actions are cast as natural language API calls to the robot (for instance, take_bottle(label), pour_liquid(volume, tol)), with system response latency ranging from 10 ms to 2 s depending on recipe complexity.
Recipe anomalies (ingredient mismatches or absences) are resolved symbolically. Missing items prompt user clarification via text-to-speech; absent user responses invoke rule-based fallbacks (e.g., garnish omission).
4. Force-Torque Feedback, Calibration, and Bimanual Control
The ATI Nano17 FT sensor, affixed to the UR3e wrist, enables feedback-driven liquid dispensing. Sensor readings map vertical force to liquid volume by linear calibration:
where is the liquid density and . Real-time control ensures
yielding ±10 ml volumetric precision. For bimanual manipulation, both UR3/UR3e arms adhere to standard kinematic parameterization, using MoveIt IK and RRT* planners to coordinate non-colliding grasps—one arm securing the glass, the other manipulating a bottle. PD torque law for joint vector is applied:
with typical , .
5. Experimental Evaluation and Performance
Shake-VLA was experimentally validated on cocktail preparation tasks:
- Vision Testing: 20 cluttered tabletop scenes (5–8 bottles, bilingual labels); 91% accuracy (σ ≈ 3%).
- Speech Testing: 30 unique recipe/instruction queries, moderate noise (office, 50 dB); 93% accurate recognition.
- Anomaly Detection: 20 recipes with intentional ingredient issues; 95% detection rate. 90% acceptance rate for user-proposed substitutions.
- System Integration: 10 end-to-end cocktail preparations, bimanual UR3 + UR3e arms, FT sensor. Achieved 100% task completion, with correct recipe formulation, manipulation, and volumetric dispensing.
Table: Module-wise Success Rates
| Module | Success Rate | Test Regime |
|---|---|---|
| Speech-to-Text (ASR) | 93% (28/30) | Noisy, accented voice, troubleshooting |
| Vision (YOLOv8+OCR) | 91% (20 scenes) | Clutter, multi-lingual labels |
| Anomaly Detection | 95% (19/20) | Mismatch/omission, symbolic thresholding |
| Full System | 100% (10 tasks) | Recipe to action (bimanual mixing) |
6. Limitations and Prospective Enhancements
Several limitations were identified:
- The vision system is robust to clutter but sensitive to long, multi-line, or low-contrast text; multi-lingual OCR extension is recommended.
- Whisper-1 ASR is reliable to 93%; explicit beam-forming or denoising could further improve performance beyond 95%.
- The current rule-based anomaly handler may conflate ambiguous ingredient substitutions (e.g., “dark rum” versus “spiced rum”).
- The robot control stack is limited to standard UR PD controllers; impedance or admittance control may offer enhanced safety, particularly in human-robot collaborative contexts.
Proposed future directions include generalization to laboratory liquid handling (variable chemical densities, automation beyond cocktails), adaptive learning from user preferences, dynamic recipe adaptation, and extension of the RAG module for nutritional or on-the-fly recipe generation (Khan et al., 12 Jan 2025).
7. Significance and Broader Implications
Shake-VLA demonstrates the potential for tightly integrated perception-language-action pipelines in stochastic, cluttered, and multimodal environments. Its architecture highlights the utility of modularity—with distinct perception, reasoning, and action stages—as well as the advantages of on-the-fly anomaly handling. The system achieves reliable operational metrics across modes, indicating viability for broader automation scenarios requiring dexterous, context-aware, and multimodal robotic manipulation. A plausible implication is the feasibility of extending VLA architectures to general-purpose laboratory automation and adaptive assistive robotics.