Shake-VLA: Robotic Cocktail System

Updated 8 February 2026

Shake-VLA is a vision-language-action robotic system that integrates object detection, OCR, ASR, and retrieval-augmented generation to automate cocktail preparation.
The system employs a modular pipeline with YOLOv8, Whisper-1, and dual-arm coordination to achieve high accuracy in cluttered, noisy environments.
Force-torque sensor feedback and symbolic anomaly detection ensure precise liquid dispensing and effective error handling during bimanual manipulation.

Shake-VLA is a Vision-Language-Action (VLA) model-based robotic system engineered for bimanual manipulation in automated cocktail preparation. It couples vision and language understanding with precise action generation, leveraging YOLOv8-based object detection, Optical Character Recognition (OCR), speech-to-text (ASR) conversion, Retrieval-Augmented Generation (RAG), anomaly detection, force-torque sensor feedback, and coordinated dual-arm manipulation. Experimental results demonstrate operational robustness in cluttered and noisy environments and seamless integration of perception, language, and physical interaction (Khan et al., 12 Jan 2025).

1. System Architecture and Core Modules

Shake-VLA employs a modular pipeline that transforms spoken natural-language instructions into sequenced, bimanual robotic actions with feedback-driven liquid dispersion. The principal modules include:

Vision: YOLOv8 single-stage detector locates bottles and objects; EasyOCR extracts label text, with outputs rendered as structured JSON records.
Speech-to-Text (ASR): OpenAI Whisper-1 model processes voice commands, trained on robust, noisy environments across diverse accents; employs token-level post-processing for utterance confidence computation.
RAG + LLM: Recipe retrieval is performed via FAISS index with OpenAI text-embedding-ada-002; GPT-4o generates robot API calls (e.g., take_glass(), pour_liquid(q, tol)), incorporating retrieved recipe, user input, and anomaly signals.
Anomaly Detection: Symbolic recipe-ingredient matching identifies discrepancies, triggering user prompts or rule-based fallbacks if necessary.
Force-Torque Sensor Integration: ATI Nano17 force-torque sensor on the UR3e arm provides real-time feedback for volumetric accuracy in pouring.
Bimanual Manipulation: UR3 and UR3e arms, equipped with 2F-Robotiq parallel grippers, execute dexterous, collision-free motions coordinated via MoveIt RRT* planning and joint-space PD controllers.

Information flow proceeds as: user audio → speech-to-text → recipe retrieval/generation (with anomaly resolution) → object/label detection → robot API generation → sequencing of actions on hardware → FT sensor for adaptive pouring feedback.

2. Perception, Recognition, and Confidence Scoring

The vision module leverages YOLOv8 for detection, setting a confidence threshold:

$s_i \geq \tau_{\rm vision},\quad \tau_{\rm vision}=0.5,$

where $s_i$ is the detection confidence for object $i$ . EasyOCR then extracts text from label bounding boxes, capable of reading both English and Russian in cluttered environments, but with observable failure modes on multi-line or low-contrast labels.

For ASR, OpenAI Whisper-1 outputs per-token probabilities $p_j$ , synthesizing a final utterance confidence:

$C_{\rm ASR} = \exp\left(\frac{1}{N} \sum_{j=1}^N \log p_j \right), \quad \text{accept if } C_{\rm ASR} \ge 0.6.$

Cluttered visual backgrounds and ambient audio noise are mitigated through robust model training, yielding recognition accuracies of 91% and 93% for vision and speech, respectively, across cluttered and noisy conditions.

3. Retrieval-Augmented Generation and Language-Guided Action

Shake-VLA's RAG module retrieves structured recipe embeddings via FAISS, indexed with OpenAI text-embedding-ada-002. The generative component (GPT-4o) synthesizes task instructions by conditioning on:

User speech transcription
Retrieved recipe content
Anomaly detection module outputs

Generated actions are cast as natural language API calls to the robot (for instance, take_bottle(label), pour_liquid(volume, tol)), with system response latency ranging from 10 ms to 2 s depending on recipe complexity.

Recipe anomalies (ingredient mismatches or absences) are resolved symbolically. Missing items prompt user clarification via text-to-speech; absent user responses invoke rule-based fallbacks (e.g., garnish omission).

4. Force-Torque Feedback, Calibration, and Bimanual Control

The ATI Nano17 FT sensor, affixed to the UR3e wrist, enables feedback-driven liquid dispensing. Sensor readings map vertical force $F_z$ to liquid volume by linear calibration:

$V(F_z) = \alpha\, F_z + \beta, \quad \alpha \approx \frac{1}{\rho g},\;\beta \approx 0,$

where $\rho$ is the liquid density and $g = 9.81\, \text{m/s}^2$ . Real-time control ensures

$|V_{\rm poured} - V_{\rm target}| \leq \Delta V_{\rm tol}, \quad \Delta V_{\rm tol} = 0.01\, \text{L},$

yielding ±10 ml volumetric precision. For bimanual manipulation, both UR3/UR3e arms adhere to standard kinematic parameterization, using MoveIt IK and RRT* planners to coordinate non-colliding grasps—one arm securing the glass, the other manipulating a bottle. PD torque law for joint vector $q$ is applied:

$\tau = K_p (q_d - q) + K_d (\dot{q}_d - \dot{q}),$

with typical $K_p = 50$ , $K_d = 5$ .

5. Experimental Evaluation and Performance

Shake-VLA was experimentally validated on cocktail preparation tasks:

Vision Testing: 20 cluttered tabletop scenes (5–8 bottles, bilingual labels); 91% accuracy (σ ≈ 3%).
Speech Testing: 30 unique recipe/instruction queries, moderate noise (office, 50 dB); 93% accurate recognition.
Anomaly Detection: 20 recipes with intentional ingredient issues; 95% detection rate. 90% acceptance rate for user-proposed substitutions.
System Integration: 10 end-to-end cocktail preparations, bimanual UR3 + UR3e arms, FT sensor. Achieved 100% task completion, with correct recipe formulation, manipulation, and volumetric dispensing.

Table: Module-wise Success Rates

Module	Success Rate	Test Regime
Speech-to-Text (ASR)	93% (28/30)	Noisy, accented voice, troubleshooting
Vision (YOLOv8+OCR)	91% (20 scenes)	Clutter, multi-lingual labels
Anomaly Detection	95% (19/20)	Mismatch/omission, symbolic thresholding
Full System	100% (10 tasks)	Recipe to action (bimanual mixing)

6. Limitations and Prospective Enhancements

Several limitations were identified:

The vision system is robust to clutter but sensitive to long, multi-line, or low-contrast text; multi-lingual OCR extension is recommended.
Whisper-1 ASR is reliable to 93%; explicit beam-forming or denoising could further improve performance beyond 95%.
The current rule-based anomaly handler may conflate ambiguous ingredient substitutions (e.g., “dark rum” versus “spiced rum”).
The robot control stack is limited to standard UR PD controllers; impedance or admittance control may offer enhanced safety, particularly in human-robot collaborative contexts.

Proposed future directions include generalization to laboratory liquid handling (variable chemical densities, automation beyond cocktails), adaptive learning from user preferences, dynamic recipe adaptation, and extension of the RAG module for nutritional or on-the-fly recipe generation (Khan et al., 12 Jan 2025).

7. Significance and Broader Implications

Shake-VLA demonstrates the potential for tightly integrated perception-language-action pipelines in stochastic, cluttered, and multimodal environments. Its architecture highlights the utility of modularity—with distinct perception, reasoning, and action stages—as well as the advantages of on-the-fly anomaly handling. The system achieves reliable operational metrics across modes, indicating viability for broader automation scenarios requiring dexterous, context-aware, and multimodal robotic manipulation. A plausible implication is the feasibility of extending VLA architectures to general-purpose laboratory automation and adaptive assistive robotics.

Markdown Report Issue Upgrade to Chat

References (1)

Shake-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Manipulations and Liquid Mixing (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Shake-VLA.

Shake-VLA: Robotic Cocktail System

1. System Architecture and Core Modules

2. Perception, Recognition, and Confidence Scoring

3. Retrieval-Augmented Generation and Language-Guided Action

4. Force-Torque Feedback, Calibration, and Bimanual Control

5. Experimental Evaluation and Performance

6. Limitations and Prospective Enhancements

7. Significance and Broader Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Shake-VLA: Robotic Cocktail System

1. System Architecture and Core Modules

2. Perception, Recognition, and Confidence Scoring

3. Retrieval-Augmented Generation and Language-Guided Action

4. Force-Torque Feedback, Calibration, and Bimanual Control

5. Experimental Evaluation and Performance

6. Limitations and Prospective Enhancements

7. Significance and Broader Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research