- The paper introduces UltraDP, a force-aware diffusion policy that uses multi-modal data to autonomously perform carotid ultrasound scanning with a 95% success rate on unseen subjects.
- The methodology combines expert demonstrations, guided DDPM diffusion, and hybrid force-impedance control to maintain accurate anatomical centering and safe probe interaction.
- Experimental results validate UltraDP’s robustness with low tracking errors and minimal contact forces, indicating its potential for reliable clinical deployment.
UltraDP: Generalizable Carotid Ultrasound Scanning with Force-Aware Diffusion Policy
Autonomous robotic ultrasound scanning addresses a pronounced demand gap caused by the cognitive and operational burden placed on trained sonographers and the global shortage of experienced practitioners. Generalization across patient anatomies is a primary unmet technical challenge: anatomical variation, tissue compliance diversity, and changing probe-tissue interaction properties demand navigation and control policies that adapt with minimal manual retuning. Prior solutions—dominated by rule-based approaches or supervised imitation—have repeatedly shown limited transfer outside the training distribution or failed to capture multi-modal corrective action strategies inherent to expert practice.
Figure 1: Demonstration of the carotid scanning task, contrasting traditional human-guided scanning and the UltraDP-enabled robotic workflow, which leverages multi-sensory fusion for closed-loop pose/wrench output.
Methodology: System Architecture and Diffusion Policy Design
UltraDP advances autonomous carotid artery scanning through a system that assimilates multi-modal perceptual data (ultrasound images, RGBD images, contact wrench, and probe pose) and generates real-valued actions in a high-dimensional space—specifically, target wrench and pose adjustments suitable for the trajectory and contact profile required by the task.
The end-to-end pipeline is structured into four functional modules: (1) real-world data collection from expert sonographers, instrumented for ground-truth pose, force, and imaging data; (2) pretraining of the ultrasound encoder via a classification-regression task to localize artery landmarks, improving sensitivity to key anatomical features; (3) navigation via a diffusion policy employing DDPM, augmented with a novel centering guidance term for stable artery alignment; and (4) hybrid force-impedance control ensuring compliant, safe probe-patient interaction across variable neck anatomies.
Figure 2: The UltraDP system spans demonstration collection, targeted visual encoder pretrain, multi-modal navigation via diffusion policy, and real-time closed-loop control.
By concatenating ultrasound image features (output of a pretrained ResNet), wrist camera imagery, end-effector pose (in 3D position + 6D orientation), and contact force/torque, UltraDP yields a rich time series observation. Output actions comprise probe pose increments (relative SE(3) transformations) and target wrench values, interpreted by the low-level controller. Notably, data augmentation is performed through spatial transformations to enhance coverage in the demonstration distribution.
Guidance-Augmented Diffusion Process
An explicit guidance term is integrated into the DDPM sampling step, derived from a regression on lateral artery pixel displacement to spatial probe adjustment—a domain-specific imaging model that enforces centering of the carotid artery. The resulting action update, at each denoising iteration, is:
ak−1=α(ak+ρ∇aku−γϵθ(o,ak,k)+N(0,σ2I))
where u encodes the pixel-to-probe mapping, enforcing anatomically valid scanning trajectories.
Interaction Control
The hybrid force-impedance controller operates at 1kHz, with a target force profile along the probe’s z-axis (perpendicular to the skin), derived from UltraDP outputs. This approach decouples lateral compliance for safe sliding from normal force regulation, using online wrench feedback for robust pose/force tracking. The architecture thus guarantees both safety and scanning quality, even under model uncertainty or minor pose estimation errors.
Figure 3: Hardware setup showing the manipulator, force/torque sensor, camera, and ultrasound probe integration for real patient experiments.
A training dataset of 210 expert-guided scans (460k observation-action pairs, 21 diverse volunteers) and an evaluation corpus of 54k pairs from further unseen subjects underpin rigorous validation.
Comparative and Real-World Trials
UltraDP exhibits a 95% success rate on previously unseen subjects performing transverse carotid scanning. This is markedly superior to both behavior cloning (BC) and a tuned visual-servoing (VS) baseline, particularly in anatomical centering and low contact force/torque variance. Specifically, UltraDP maintains a mean tracking error of 0.0135m (unknowns) and limits maximum contact force to <4N, outperforming the VS baseline in both image quality retention and force safety.











Figure 4: Snapshots and ultrasound outputs across policies. UltraDP consistently recenters the artery and recognizes bifurcations, unlike BC (which loses the anatomical landmark) or VS (which induces probe detachment or patient discomfort).
Figure 5: Probe contact force evolution during scanning: UltraDP stabilizes normal force, while VS and BC either spike force dangerously or lose scanning contact.
Ablations: Sensory Modalities and Policy Robustness
Ablating force or pose leads to dramatic policy degradation. Absence of wrench data results in probe detachment (failure to maintain contact), confirming that force information is essential for interaction stability. Removing pose data degrades landmark centering, validating the multi-modal input hypothesis.
Figure 6: Ablation of force modality: UltraDP with full input maintains predictable force and safe contact, while the no-force variant fails to track/compress the neck surface.
Subjective Usability Outcomes
Patient questionnaire data strongly favor UltraDP with respect to comfort and scan efficiency, with minimal perceived discomfort or abnormal contact, contrasting sharply to complaints under BC or VS operation.
Figure 7: Aggregated subjective feedback highlights UltraDP’s significant comfort and usability lead.
Implications and Future Directions
UltraDP demonstrates that diffusion models, when guided with domain-specific anatomical priors and multimodal feedback, can act as robust visuomotor policies for challenging human-interactive robotic procedures. The critical role of force-aware action outputs and multisensory inputs is empirically affirmed, especially for tasks characterized by high interindividual variance and safety-critical physical interaction.
Practically, this indicates the feasibility of deploying learning-based ultrasound robots in general clinical settings—without per-patient retuning—provided sufficient human-expert demonstration corpora and careful data augmentation. Theoretically, the approach opens avenues for multi-modal, guidance-augmented diffusion control in broader contact-rich, human-in-the-loop or patient-specific medical robotics.
Immediate future work should scale the system to larger, more demographically variable populations to further validate its generalization, potentially incorporating real-time uncertainty quantification and adaptive guidance mechanisms. Beyond ultrasound scanning, this paradigm may extend to other robotic medical imaging or interventional tasks where spatial accuracy and compliant-force management are essential.
Conclusion
UltraDP establishes a new standard for generalizable, force-safe, and anatomically accurate robotic ultrasound scanning by leveraging multimodal data, explicit anatomical guidance, and diffusion policy learning. Its superior success rates and force regulation on unseen patients, robust ablation results, and positive user studies underline its suitability for clinical deployment and provide a blueprint for similar applications in medical AI and human-robot interaction (2511.15550).