CUPID: Curating Data your Robot Loves with Influence Functions

Published 23 Jun 2025 in cs.RO, cs.AI, and cs.LG | (2506.19121v2)

Abstract: In robot imitation learning, policy performance is tightly coupled with the quality and composition of the demonstration data. Yet, developing a precise understanding of how individual demonstrations contribute to downstream outcomes - such as closed-loop task success or failure - remains a persistent challenge. We propose CUPID, a robot data curation method based on a novel influence function-theoretic formulation for imitation learning policies. Given a set of evaluation rollouts, CUPID estimates the influence of each training demonstration on the policy's expected return. This enables ranking and selection of demonstrations according to their impact on the policy's closed-loop performance. We use CUPID to curate data by 1) filtering out training demonstrations that harm policy performance and 2) subselecting newly collected trajectories that will most improve the policy. Extensive simulated and hardware experiments show that our approach consistently identifies which data drives test-time performance. For example, training with less than 33% of curated data can yield state-of-the-art diffusion policies on the simulated RoboMimic benchmark, with similar gains observed in hardware. Furthermore, hardware experiments show that our method can identify robust strategies under distribution shift, isolate spurious correlations, and even enhance the post-training of generalist robot policies. Videos and code are made available at: https://cupid-curation.github.io.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper’s main contribution is a scalable method that uses influence functions to causally attribute demonstration impact on closed-loop policy returns.
It employs action-level attributions to filter harmful and select beneficial training samples, outperforming traditional heuristic-based approaches.
Empirical evaluations demonstrate improved policy robustness under distribution shifts and enhanced data efficiency in various robotic tasks.

Influence-Function Data Curation for Robot Imitation Learning: The CUPID Approach

The paper “CUPID: Curating Data your Robot Loves with Influence Functions” (2506.19121) presents a formal and scalable methodology for robot demonstration dataset curation based on influence functions. The approach leverages a precise, action-level attribution mechanism to relate each demonstration’s causal effect on the downstream policy’s closed-loop expected return, facilitating both filtering harmful or redundant samples and selecting trajectories that maximize policy performance.

Motivation and Problem Statement

Policy performance in behavior cloning (BC), and more broadly in imitation learning (IL), depends intricately on not just the overall quality but the exact composition of demonstration data. Current strategies for robot data curation typically rely on task-agnostic heuristics (e.g., mutual information, classifier-based success state metrics), which do not explicitly link training data to closed-loop policy evaluation metrics and can result in misattribution—especially where spurious data correlations exist.

CUPID formalizes robot data curation as the identification of which demonstrations maximally contribute to the policy’s expected return:

Filter-$k$ Demonstrations: Remove $k$ samples from training set that degrade expected return.
Select-$k$ Demonstrations: Add $k$ demonstrations into training set to maximize policy return, given a budget.

Evaluation policies are tested via rollouts, aligning estimation with practical robot testing protocols.

Figure 1: CUPID leverages influence functions to answer counterfactual data attribution questions by precisely measuring impact on policy performance.

Influence Function Formulation for IL Policies

The core methodological advance is adapting influence function theory (originally for supervised learning, e.g. [koh2017understanding]) to attributing sequential policy decisions. For demonstration $\xi$ and a trained BC policy $\pi_\theta$, the performance influence is defined:

$I(\xi) := \frac{dJ(\pi_\theta)}{d\epsilon}\bigg|_{\epsilon=0} = -\nabla_\theta J(\pi_\theta)^\top H_{\mathrm{bc}}^{-1} \nabla_\theta \ell_{\mathrm{traj}}(\xi; \pi_\theta)$

where $J(\pi_\theta)$ is expected return from policy rollouts, $H_{\mathrm{bc}}$ the Hessian of the BC loss, and $\ell_{\mathrm{traj}}$ is trajectory-level BC loss.

Critically, this influence decomposes into a weighted sum of action influences (at the level of individual state-action pairs) over test rollouts and demonstration transitions:

$I(\xi) = \mathbb{E}_{\tau\sim p(\tau|\pi_\theta)}\left[\frac{R(\tau)}{H} \sum_{(s',a')\in\tau} \sum_{(s,a)\in\xi} a\left((s',a'),(s,a)\right)\right]$

where $a\left((s',a'),(s,a)\right) := -\nabla_\theta \log \pi_\theta(a'|s')^\top H_{\mathrm{bc}}^{-1} \nabla_\theta \ell(s,a; \pi_\theta)$.

Practical Implementation: Diffusion Policy Attribution

In modern robot policy architectures (e.g., diffusion policies [chi2023diffusion]), directly computing log-likelihood gradients is nontrivial due to the denoising generative process. CUPID approximates the log-likelihood gradient with the denoising loss and leverages the Gauss-Newton Hessian and random projection ensemble (TRAK [park2023trak]) for scalable influence estimation. Surrogate loss functions further improve correlation with actual leave-one-out changes in policy behavior.

Implementation specifics:

For each rollout (success/failure trajectory), compute $I(\xi)$ via action-level influences between all demonstration samples and rollout transitions.
The top-$k$ removal/select sets are thus chosen by ranking demonstrations by $I(\xi)$.
The estimator’s accuracy and ranking fidelity are robust for $m \geq 25$ rollouts for most tasks, scaling up to $m \geq 100$ for high-precision, complex tasks (see Figure 2 and Figure 3).

Contrast with Baselines and Ablative Benchmarks

The method is compared against several state-of-the-art and practical baselines:

DemInf [hejna2025robotdatacurationmutual]: Mutual information-based curation.
Demo-SCORE [chen2025curating]: Classifier-based state success probability.
Success similarity (custom): Embedding-based state similarity to successful rollouts.
Oracle: Ground-truth label-based curation.
Figure 4: CUPID curates demonstrations deterministically based on measured influence, strongly correlating curation with actual closed-loop evaluation.

CUPID exhibits several strong empirical findings:

Contradictory to common intuition, highest human-quality demonstrations do not always yield the best policy performance (see Figure 5, Figure 6, Figure 7).
CUPID’s causal attribution allows it to outperform quality-based methods in identifying robust strategies, especially under test-time distribution shift, and in mitigating spurious correlations (Figure 8, Figure 9).
On tasks with highly redundant data (e.g., RoboMimic LiftMH), aggressive filtering increases performance; on precision-critical tasks (e.g., Figure-8, TuckBox), it avoids the low-variance failures induced by spurious correlations and strategy mode collapse.

Scaling Properties and Computational Considerations

Influence computation is tractable for datasets up to hundreds of demonstrations (matching policy training cost), and can be batched via random projection and Hessian approximation.
For large-scale generalist models (e.g., Vision-Language-Action (VLA) policies [black2410pi0]), CUPID-curated single-task datasets improve post-training success, as shown in transfer experiments (Figure 10).
Estimator variance is suppressed for curation (ranking) compared to reinforcement learning (absolute value) tasks; a relatively small number of rollouts suffices for reliable ranking of helpful/harmful demonstrations.
Joint demonstration effects (beyond greedy top-$k$ selection) represent a future direction for optimizing set diversity and group interactions.

Flexibility: Hybrid Metrics and Policy Generalization

CUPID optionally hybridizes its influence signal with heuristic quality metrics for tasks where low variance and predictability are desirable, preserving the ability to ablate or blend signals based on problem specifics (see Figure 11, Figure 12).
The influence-function framework is architecturally agnostic, supporting attribution in diffusion models, autoregressive policies, or others supporting computation of trajectory-level or action-level gradients.

Implications and Future Directions

Theoretically, the influence-function viewpoint provides a controllable lever for counterfactual analysis (“what-if” removal/addition) of training samples, allowing precise diagnosis and repair of dataset pathologies. Practically, CUPID’s curation can improve robot data efficiency, policy robustness under distribution shift, and generalization, especially as demonstration datasets continue to scale in diversity and volume.

Further advances will focus on:

Joint set optimization for diversity.
Reducing computational overhead for large robot data corpora.
Extending beyond passive curation to active dataset explanation and collection guidance.

Conclusion

CUPID advances data curation for robot imitation learning by directly causally attributing closed-loop policy performance to individual demonstrations via scalable influence function methods. Empirical results show strong improvements over baselines, robustness to distribution shift, and effective mitigation of spurious correlations, even when using a fractional subset of the original data. This methodology is broadly applicable to curation in both single-task and generalist robotic policy training, and represents a foundational step toward trustworthy, systematic robot data valuation and selection.

Figure 5: CUPID-curated policies outperform baselines on RoboMimic mixed-quality curation benchmarks, demonstrating that perceived demonstration quality does not always maximize success.

Figure 8: Franka real-world robot policy performance demonstrates CUPID’s ability to filter brittle or spurious strategies while increasing overall robustness.