Papers
Topics
Authors
Recent
Search
2000 character limit reached

Shake-VLA: Robotic Cocktail System

Updated 8 February 2026
  • Shake-VLA is a vision-language-action robotic system that integrates object detection, OCR, ASR, and retrieval-augmented generation to automate cocktail preparation.
  • The system employs a modular pipeline with YOLOv8, Whisper-1, and dual-arm coordination to achieve high accuracy in cluttered, noisy environments.
  • Force-torque sensor feedback and symbolic anomaly detection ensure precise liquid dispensing and effective error handling during bimanual manipulation.

Shake-VLA is a Vision-Language-Action (VLA) model-based robotic system engineered for bimanual manipulation in automated cocktail preparation. It couples vision and language understanding with precise action generation, leveraging YOLOv8-based object detection, Optical Character Recognition (OCR), speech-to-text (ASR) conversion, Retrieval-Augmented Generation (RAG), anomaly detection, force-torque sensor feedback, and coordinated dual-arm manipulation. Experimental results demonstrate operational robustness in cluttered and noisy environments and seamless integration of perception, language, and physical interaction (Khan et al., 12 Jan 2025).

1. System Architecture and Core Modules

Shake-VLA employs a modular pipeline that transforms spoken natural-language instructions into sequenced, bimanual robotic actions with feedback-driven liquid dispersion. The principal modules include:

  • Vision: YOLOv8 single-stage detector locates bottles and objects; EasyOCR extracts label text, with outputs rendered as structured JSON records.
  • Speech-to-Text (ASR): OpenAI Whisper-1 model processes voice commands, trained on robust, noisy environments across diverse accents; employs token-level post-processing for utterance confidence computation.
  • RAG + LLM: Recipe retrieval is performed via FAISS index with OpenAI text-embedding-ada-002; GPT-4o generates robot API calls (e.g., take_glass(), pour_liquid(q, tol)), incorporating retrieved recipe, user input, and anomaly signals.
  • Anomaly Detection: Symbolic recipe-ingredient matching identifies discrepancies, triggering user prompts or rule-based fallbacks if necessary.
  • Force-Torque Sensor Integration: ATI Nano17 force-torque sensor on the UR3e arm provides real-time feedback for volumetric accuracy in pouring.
  • Bimanual Manipulation: UR3 and UR3e arms, equipped with 2F-Robotiq parallel grippers, execute dexterous, collision-free motions coordinated via MoveIt RRT* planning and joint-space PD controllers.

Information flow proceeds as: user audio → speech-to-text → recipe retrieval/generation (with anomaly resolution) → object/label detection → robot API generation → sequencing of actions on hardware → FT sensor for adaptive pouring feedback.

2. Perception, Recognition, and Confidence Scoring

The vision module leverages YOLOv8 for detection, setting a confidence threshold:

siτvision,τvision=0.5,s_i \geq \tau_{\rm vision},\quad \tau_{\rm vision}=0.5,

where sis_i is the detection confidence for object ii. EasyOCR then extracts text from label bounding boxes, capable of reading both English and Russian in cluttered environments, but with observable failure modes on multi-line or low-contrast labels.

For ASR, OpenAI Whisper-1 outputs per-token probabilities pjp_j, synthesizing a final utterance confidence:

CASR=exp(1Nj=1Nlogpj),accept if CASR0.6.C_{\rm ASR} = \exp\left(\frac{1}{N} \sum_{j=1}^N \log p_j \right), \quad \text{accept if } C_{\rm ASR} \ge 0.6.

Cluttered visual backgrounds and ambient audio noise are mitigated through robust model training, yielding recognition accuracies of 91% and 93% for vision and speech, respectively, across cluttered and noisy conditions.

3. Retrieval-Augmented Generation and Language-Guided Action

Shake-VLA's RAG module retrieves structured recipe embeddings via FAISS, indexed with OpenAI text-embedding-ada-002. The generative component (GPT-4o) synthesizes task instructions by conditioning on:

  • User speech transcription
  • Retrieved recipe content
  • Anomaly detection module outputs

Generated actions are cast as natural language API calls to the robot (for instance, take_bottle(label), pour_liquid(volume, tol)), with system response latency ranging from 10 ms to 2 s depending on recipe complexity.

Recipe anomalies (ingredient mismatches or absences) are resolved symbolically. Missing items prompt user clarification via text-to-speech; absent user responses invoke rule-based fallbacks (e.g., garnish omission).

4. Force-Torque Feedback, Calibration, and Bimanual Control

The ATI Nano17 FT sensor, affixed to the UR3e wrist, enables feedback-driven liquid dispensing. Sensor readings map vertical force FzF_z to liquid volume by linear calibration:

V(Fz)=αFz+β,α1ρg,  β0,V(F_z) = \alpha\, F_z + \beta, \quad \alpha \approx \frac{1}{\rho g},\;\beta \approx 0,

where ρ\rho is the liquid density and g=9.81m/s2g = 9.81\, \text{m/s}^2. Real-time control ensures

VpouredVtargetΔVtol,ΔVtol=0.01L,|V_{\rm poured} - V_{\rm target}| \leq \Delta V_{\rm tol}, \quad \Delta V_{\rm tol} = 0.01\, \text{L},

yielding ±10 ml volumetric precision. For bimanual manipulation, both UR3/UR3e arms adhere to standard kinematic parameterization, using MoveIt IK and RRT* planners to coordinate non-colliding grasps—one arm securing the glass, the other manipulating a bottle. PD torque law for joint vector qq is applied:

τ=Kp(qdq)+Kd(q˙dq˙),\tau = K_p (q_d - q) + K_d (\dot{q}_d - \dot{q}),

with typical Kp=50K_p = 50, Kd=5K_d = 5.

5. Experimental Evaluation and Performance

Shake-VLA was experimentally validated on cocktail preparation tasks:

  • Vision Testing: 20 cluttered tabletop scenes (5–8 bottles, bilingual labels); 91% accuracy (σ ≈ 3%).
  • Speech Testing: 30 unique recipe/instruction queries, moderate noise (office, 50 dB); 93% accurate recognition.
  • Anomaly Detection: 20 recipes with intentional ingredient issues; 95% detection rate. 90% acceptance rate for user-proposed substitutions.
  • System Integration: 10 end-to-end cocktail preparations, bimanual UR3 + UR3e arms, FT sensor. Achieved 100% task completion, with correct recipe formulation, manipulation, and volumetric dispensing.

Table: Module-wise Success Rates

Module Success Rate Test Regime
Speech-to-Text (ASR) 93% (28/30) Noisy, accented voice, troubleshooting
Vision (YOLOv8+OCR) 91% (20 scenes) Clutter, multi-lingual labels
Anomaly Detection 95% (19/20) Mismatch/omission, symbolic thresholding
Full System 100% (10 tasks) Recipe to action (bimanual mixing)

6. Limitations and Prospective Enhancements

Several limitations were identified:

  • The vision system is robust to clutter but sensitive to long, multi-line, or low-contrast text; multi-lingual OCR extension is recommended.
  • Whisper-1 ASR is reliable to 93%; explicit beam-forming or denoising could further improve performance beyond 95%.
  • The current rule-based anomaly handler may conflate ambiguous ingredient substitutions (e.g., “dark rum” versus “spiced rum”).
  • The robot control stack is limited to standard UR PD controllers; impedance or admittance control may offer enhanced safety, particularly in human-robot collaborative contexts.

Proposed future directions include generalization to laboratory liquid handling (variable chemical densities, automation beyond cocktails), adaptive learning from user preferences, dynamic recipe adaptation, and extension of the RAG module for nutritional or on-the-fly recipe generation (Khan et al., 12 Jan 2025).

7. Significance and Broader Implications

Shake-VLA demonstrates the potential for tightly integrated perception-language-action pipelines in stochastic, cluttered, and multimodal environments. Its architecture highlights the utility of modularity—with distinct perception, reasoning, and action stages—as well as the advantages of on-the-fly anomaly handling. The system achieves reliable operational metrics across modes, indicating viability for broader automation scenarios requiring dexterous, context-aware, and multimodal robotic manipulation. A plausible implication is the feasibility of extending VLA architectures to general-purpose laboratory automation and adaptive assistive robotics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Shake-VLA.