Cosmos-Surg-dVRK: Video-Driven Surgical Simulation

Updated 24 January 2026

Cosmos-Surg-dVRK is a WFM-based platform that enables high-fidelity, vision-driven simulation of surgical robotics for automated policy evaluation.
It integrates a fine-tuned video diffusion model with a V-JEPA 2 classifier to predict surgical scenes and benchmark policies against real dVRK performance.
The platform advances reproducible policy development by combining data-driven simulation with objective, scalable outcome classification.

Cosmos-Surg-dVRK is a world foundation model (WFM)-based simulation and automated policy evaluation platform for surgical robotics, augmenting the da Vinci Research Kit (dVRK) research ecosystem with high-fidelity, vision-driven simulations and scalable benchmarking pipelines. Leveraging a fine-tuned, large video diffusion model (Cosmos WFM), Cosmos-Surg-dVRK enables data-driven prediction of complex surgical scenes—incorporating soft-tissue deformation and tool–tissue interactions—conditioned on real dVRK actions. Coupled with the V-JEPA 2 video classifier, this infrastructure provides fully automated online evaluation of surgical robot policies and strong correlation with physical hardware performance, thereby accelerating policy development, hyperparameter search, and reproducibility for advanced vision-language-action robotics in medicine (Zbinden et al., 17 Oct 2025).

1. Motivation and Theoretical Foundation

Evaluating autonomous or assistive control policies in surgical robotics requires extensive, reproducible trials across diverse surgical scenarios. Direct deployment on the dVRK presents significant barriers: high per-trial cost, labor-intensive resets, calibration burdens, and variability arising from cable-actuated mechanisms, platform differences, and institutional safety protocols. Furthermore, ex-vivo and in-vivo procedures necessitate ethics review and precise biological sample logistics. These constraints slow inference cycles and limit reliable algorithmic comparison (Zbinden et al., 17 Oct 2025).

World Foundation Models (WFMs), exemplified by the Cosmos architecture, address these limitations by learning video-conditional, action-driven generative dynamics directly from paired surgical video and kinematic data. Rather than specifying explicit physical models (e.g., FEM, MPM) or tuning synthetic simulations, WFMs are trained to autoregressively predict future visual states ( $s_{t+1}$ ) conditioned on observed frames ( $s_t$ ) and control actions ( $a_t$ ), thereby capturing both rigid-body and deformable-tissue dynamics in a unified, data-driven framework.

2. Cosmos-Surg-dVRK Model Architecture

Cosmos-Surg-dVRK is implemented as a fine-tuned variant of the base Cosmos WFM (Cosmos-Predict2-2B-Video2World), a transformer-based latent diffusion model for multi-modal video generation. At each inference step $i$ , the model receives the current RGB frame $s_i$ and a window of future control actions $a_{i:i+K-1}$ (relative end-effector translations, rotations, and jaw angles; $a_t\in\mathbb{R}^7$ ), predicting the next $K$ camera frames $\hat{s}_{i+1:i+K}$ via latent denoising diffusion (Zbinden et al., 17 Oct 2025).

The fine-tuning process utilizes paired video and action sequences from 3,036 tabletop suturing episodes (~13 h) and 16,506 ex-vivo porcine cholecystectomy episodes (~18 h), all sampled at 10 Hz. The optimization objective combines pixel-level visual loss, standard latent diffusion denoising loss, and KL-regularization to align posterior and prior latent distributions:

$\mathcal{L}_{\mathrm{sim}} = \sum_{t=0}^{T-1} \lVert s_{t+1} - \hat{s}_{t+1} \rVert_2^2 + \lambda_{\mathrm{diff}}\,\mathcal{L}_{\mathrm{diffusion}}(s_{t+1}, \hat{s}_{t+1}) + \lambda_{\mathrm{KL}}\,\mathrm{KL}(q(z_{t+1}|s_{t+1})\, ||\, p(z_{t+1}|s_t,a_t))$

where $s_t$ 0 is the denoising loss and $s_t$ 1 (Zbinden et al., 17 Oct 2025).

Implicitly, Cosmos-Surg-dVRK models complex, high-DOF interactions by learning the black-box mapping $s_t$ 2, with $s_t$ 3 parameterizing both geometric and deformable scene evolution, including tissue compliance, suture tension, needle penetration, and tool trajectories.

3. Automated Policy Evaluation Pipeline

To provide objective and scalable benchmarking, Cosmos-Surg-dVRK integrates an automated video classification pipeline using a fine-tuned V-JEPA 2 model. Video rollouts generated by executing policy-driven action sequences in the learned simulator are segmented into 32-frame overlapping chunks (6-frame overlap). Each chunk receives one of three annotations: success (task completed), anomaly (physics inconsistency or hallucination), or default (neither), yielding a per-trial outcome of "success" if any segment is labeled as such before anomaly.

The classifier employs a frozen V-JEPA 2 ViT-Huge backbone (632M parameters), followed by a four-block transformer probe with attentive pooling and a three-way softmax. Cross-entropy loss is weighted with a strong anomaly penalty ( $s_t$ 4), using hyperparameters selected by BOHB search (learning rate $s_t$ 5, batch size 8, 200 epochs) (Zbinden et al., 17 Oct 2025).

Evaluation proceeds by (1) initializing rollouts from a real dVRK observation $s_t$ 6, (2) iteratively predicting $s_t$ 7 future frames at 10 Hz, (3) segmenting video outputs for classifier prediction, and (4) comparing pipeline success rates to those measured on physical dVRK Si hardware using identical policy checkpoints.

4. Experimental Results and Metrics

Performance benchmarking of Cosmos-Surg-dVRK leverages six metrics: success rate (SR), Pearson correlation ( $s_t$ 8), Cohen's kappa ( $s_t$ 9), Intraclass Correlation (ICC(2,1)), mean maximum rank violation (MMRV), and mean bias error (MBE).

Task	Manual $a_t$ 0	Automated $a_t$ 1	Automated MMRV	Automated MBE
Handover	0.656	0.656	0.133	—
Throw	0.639	0.639	0.117	—
Knot Tie	0.729	0.729	0.033	—
Pickup	0.639	0.639	0.100	—
Average	0.666	0.666	0.096	0.153

Classifier–human agreement achieved ICC of 0.836 and Pearson $a_t$ 2 (p<0.001). For dVRK-to-simulator policy outcome correlation on tabletop tasks, V-JEPA 2-based labeling yielded $a_t$ 3 (manual: $a_t$ 4), with low ranking violations (MMRV = 0.096) and minor bias (MBE = 0.153). For a representative ex-vivo cholecystectomy policy, Cosmos-Surg-dVRK reproduced real platform performance (Cosmos-Surg: 8/9 tasks; dVRK Si: 7/9) (Zbinden et al., 17 Oct 2025).

5. Comparison to Prior Approaches

Conventional evaluation frameworks in dVRK robotics relied on direct hardware trials (costly, low throughput) or rigid-body simulators with limited support for deformable tissue, realistic camera views, or automated success classification. Real-time physical simulators such as SurRoL provide extensible task libraries and robust contact interactions but cannot model soft-tissue deformation or hallucinated failure events without explicit physics or vision models (Xu et al., 2021). Cosmos-Surg-dVRK uniquely enables:

Learned, action-conditioned frame prediction with implicit, video-grounded soft-tissue and tool–tissue modeling (removing manual parameter tuning).
Automated outcome classification, aligning closely with human rater outcomes and real-robot ground truth.
Integration of policy-driven rollouts with visual feedback, enabling interactive rates (~10 Hz) on standard GPU hardware and high-throughput experimentation.
Reproducible digital twin benchmarking, isolating platform and software variability while maintaining realistic vision-based observation spaces (Zbinden et al., 17 Oct 2025).

6. Limitations and Simulation-to-Real Gaps

Certain limitations constrain Cosmos-Surg-dVRK’s domain of validity:

The model currently operates with monocular, endoscopic RGB inputs—lacking additional perspectives (e.g., wrist cameras) and proprioceptive/force feedback.
Boundary artifacts and fine instrument structures (e.g., needles, suture thread) may exhibit blurring, complicating fine-grained success/failure calls.
The diffusion model’s "positive success bias" inflates perceived task completion rates in datasets with limited failure annotation; inclusion of negative examples in training reduces mean bias (MBE) from 0.325 to 0.140.
Physics anomalies, including object interpenetration, arise due to the absence of explicit geometric/contact constraints.
The simulator does not enforce real-time feedback control loops or complex multi-modal inputs required by certain state-of-the-art RL policies (Zbinden et al., 17 Oct 2025).

Mitigating these factors involves training with richer negative trajectories, incorporating multi-view and state-token augmentations, and hybridizing data-driven diffusion with explicit physics priors.

7. Future Directions

Planned advancements encompass:

Extension to multi-view and multi-modal state spaces (e.g., addition of wrist/endoscope data, force/torque signals).
Scaling to complex, long-horizon surgical workflows (e.g., tumor resection, anastomosis) through fine-tuning on broader procedural datasets.
Integrating explicit differentiable physical constraints or contact/collision models to curtail hallucinations and enforce feasible tool–tissue interactions.
Enabling model-based RL and world-model-guided policy improvement inside Cosmos-Surg, leveraging the simulator capability for closed-loop controller optimization.
Inference speedups via model distillation or quantization for real-time, in situ surgical assistance or online feedback loops (Zbinden et al., 17 Oct 2025).

Cosmos-Surg-dVRK provides a high-fidelity, reproducible, and automated research infrastructure that bridges policy training between real-world surgical platforms and scalable simulation, with strong empirical alignment to hardware-based evaluations. It is positioned as a foundational tool for accelerating and standardizing autonomous surgical robotics research.

Markdown Report Issue Upgrade to Chat

References (2)

Cosmos-Surg-dVRK: World Foundation Model-based Automated Online Evaluation of Surgical Robot Policy Learning (2025)

SurRoL: An Open-source Reinforcement Learning Centered and dVRK Compatible Platform for Surgical Robot Learning (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cosmos-Surg-dVRK.