Papers
Topics
Authors
Recent
Search
2000 character limit reached

UniPrototype Framework: Skill Transfer

Updated 6 February 2026
  • UniPrototype is a framework for human-robot skill learning that leverages unified, compositional prototypes to enable efficient knowledge transfer from human demonstrations.
  • The architecture employs a three-stage process—temporal encoding, compositional prototype discovery, and skill alignment—to robustly align human and robot motion features.
  • Experimental results in simulation and real-world scenarios demonstrate significant improvements in cross-embodiment success rates and robust performance under varied conditions.

UniPrototype is a framework for human-robot skill learning that leverages a unified, compositional prototype representation to facilitate efficient knowledge transfer from human demonstrations to robotic embodiments. Addressing the persistent issue of data scarcity in robotic manipulation, UniPrototype enables shared motion primitives and compositional skill representations, supporting robust policy learning and cross-embodiment generalization. Its primary contributions are a compositional prototype discovery mechanism with soft multi-prototype assignments, an adaptive prototype selection strategy using assignment entropy, and the demonstration of effective human-to-robot manipulation knowledge transfer across both simulation and real-world settings (Hu et al., 27 Sep 2025).

1. Framework Architecture and Knowledge Transfer Pipeline

The UniPrototype pipeline is structured into three principal stages that process unpaired human demonstration videos and robot demonstration datasets:

  • Stage 1: Temporal Skill Encoding
    • Each demonstration video is divided into overlapping clips vijv_{ij} of length LL.
    • A shared transformer-based encoder ftempf_{\text{temp}} maps each clip to a temporal embedding zij=ftemp(vij)Rdz_{ij} = f_{\text{temp}}(v_{ij}) \in \mathbb{R}^d.
    • Data augmentations (cropping, color, geometric changes) enforce invariance to embodiment, improving the alignment of human and robot motion features.
  • Stage 2: Compositional Prototype Discovery
    • A learnable prototype matrix CRd×KC \in \mathbb{R}^{d \times K} is maintained.
    • For a batch of embeddings Z=[z1,...,zB]Z = [z_1, ..., z_B], similarity is computed as S=CZS = C^\top Z, and soft assignments are produced:

    Qi,k=exp(Sk,i/τ)kexp(Sk,i/τ)Q_{i, k} = \frac{\exp(S_{k, i}/\tau)}{\sum_{k'} \exp(S_{k', i}/\tau)}

    where τ\tau is a temperature parameter and row-wise normalization of SS^\top enables compositional representations (multiple co-active prototypes). - The encoder and prototypes are trained jointly using: - Prototype consistency loss Lproto\mathcal{L}_{\text{proto}} (contrastive over augmentations). - Temporal coherence loss Ltemp\mathcal{L}_{\text{temp}} (time-contrastive). - The number of prototypes KK is selected via entropy-based monitoring (see Section 3).

  • Stage 3: Skill Alignment and Policy Learning

    • An attention-based Skill Alignment Module (SAM) aligns current robot observations otrobo_t^{\text{rob}} to the next prototype in the human-extracted sequence.
    • A diffusion policy π(atst,zt)\pi(a_t | s_t, z_t) is trained, conditioned on compositional embedding ztz_t, to produce robot actions via iterative denoising.
    • At inference, human demonstration encodings are aligned and rolled out by the diffusion policy to execute corresponding robotic behavior.

2. Compositional Prototype Discovery Mechanism

The compositional prototype module enables flexible, multi-prototype activation for each encoded skill segment:

  • Soft Assignment
    • For batch ZRd×BZ \in \mathbb{R}^{d \times B} and prototype matrix CRd×KC \in \mathbb{R}^{d \times K}:
    • Similarity: S=CZS = C^\top Z, Sk,i=ckziS_{k, i} = c_k \cdot z_i.
    • Assignment: Qi,k=exp(Sk,i/τ)kexp(Sk,i/τ)Q_{i, k} = \frac{\exp(S_{k, i}/\tau)}{\sum_{k'} \exp(S_{k', i}/\tau)} (row-normalized over SS^\top).
    • This mechanism contrasts with hard clustering (e.g., Sinkhorn), where only one prototype is active per embedding. Soft assignment allows hierarchical and blended skill representations.
  • Losses

    • Prototype Consistency:

    Lproto=i=1Bk=1Kqi,k(1)logpi,k(2)\mathcal{L}_{\text{proto}} = - \sum_{i=1}^B \sum_{k=1}^{K} q_{i, k}^{(1)} \log p_{i, k}^{(2)}

    where q(1)q^{(1)} and p(2)p^{(2)} are assignments from two augmentations of the same clip. - Temporal Coherence:

    Ltemp=i,j1[j=i+δ]logexp(sim(zi,zj)/τt)exp(sim(zi,z)/τt)\mathcal{L}_{\text{temp}} = - \sum_{i,j} \mathbb{1}[j = i+\delta] \log \frac{\exp(\text{sim}(z_i, z_j)/\tau_t)}{\sum_{\ell} \exp(\text{sim}(z_i, z_\ell)/\tau_t)}

    enforcing temporal smoothness in skill representations.

3. Adaptive Prototype Selection via Entropy

To ensure scalability and alignment with task complexity, UniPrototype employs an entropy-based strategy for adaptive prototype selection:

  • The average prototype activation is:

pˉk=1NiQi,k\bar{p}_k = \frac{1}{N} \sum_i Q_{i,k}

  • Assignment entropy for KK prototypes:

H(K)=1Kk=1KpˉklogpˉkH(K) = - \frac{1}{K} \sum_{k=1}^K \bar{p}_k \log \bar{p}_k

  • KK is increased until the entropy increment ΔH(K)=H(K+ΔK)H(K)\Delta H(K) = |H(K+\Delta K) - H(K)| falls below threshold θ\theta, and the smallest KK^* achieving this is selected, avoiding unnecessary overparameterization.

Adaptive KK^* analysis in experiments showed task-dependent prototype counts: simple tasks (K48K^* \approx 48–$72$), tool use (K84K^* \approx 84–$108$), multi-step tasks (K96K^* \approx 96–$132$), and complex tasks (K120K^* \approx 120–$156$), corresponding to higher entropies for more complex behaviors.

4. Training and Inference Workflow

The training process comprises prototype discovery and policy learning, as summarized below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Inputs:
  Human demos 𝒟ʰ, Robot demos 𝒟ʳ
  K_vals = [K1, K2, ], threshold θ, ΔK

Preprocessing:
  For each video in 𝒟ʰ𝒟ʳ:
    sample M clips vᵢⱼ (sliding window, length L)
    apply augmentations on the fly

// Prototype discovery
for K in K_vals:
  Initialize prototypes C  ℝᵈˣᴷ
  for epoch = 1..E:
    for batch {vᵢ}:
      zᵢ = fₜₑₘₚ(vᵢ)
      S = Cᵀ Z
      Q = row_norm(exp(S/τ))
      L_proto  contrastive assignment loss using Q
      L_temp  temporal coherence over zᵢ
      L_total = L_proto + λ L_temp
      backprop(L_total)
  evaluate average assignment \bar p_k, compute H(K)
  if |H(K)H(prev K)| < θ:
    K* = K; break

// Diffusion policy training
Extract {zₜ} from robot demos
Train π(a:H | s, z) via denoising score matching

Return: fₜₑₘₚ, C with K*, diffusion policy π, SAM

At test time, human demonstration encodings are mapped via ftempf_\text{temp} and QQ, aligned with SAM, and executed using the trained diffusion policy.

5. Experimental Setups and Evaluation Metrics

Experiments were conducted in both simulated and real-world conditions:

Setting Task Types Metrics Baselines
RLBench 100 manipulation tasks Success rate (%) GCD Policy, GCD+TCN, XSkill
Real-world Table wiping, grasp/place, drawer, spatula flipping Success rate (%) & robustness (clutter, lighting, position shifts) GCD Policy, XSkill

Simulated manipulations covered tasks such as emptying dishwashers, closing boxes, and peg insertion, with evaluation at multiple robot execution speeds (×1.0\times 1.0, ×2.0\times 2.0). Real-world tests used a Franka Emika Panda arm over varied object identities and spatial conditions, measuring both task success and robustness.

6. Quantitative and Qualitative Performance

UniPrototype demonstrated state-of-the-art transfer and execution robustness:

Simulation (RLBench) Cross-Embodiment Success Rates (%):

Method Same-speed Cross-speed×1.0 Cross-speed×2.0
GCD Policy 68.3±2.1 12.4±1.8 4.1±0.9
GCD+TCN 71.2±1.9 24.7±2.3 11.6±1.7
XSkill 84.6±1.5 78.2±1.8 52.3±2.4
UniPrototype 91.3±1.2 87.5±1.4 71.2±2.0

Real-World Success Rates (%):

Task GCD Policy XSkill UniPrototype
Table Wiping 20.8±4.2 45.8±5.1 70.8±4.5
Cup Grasping 41.7±4.8 66.7±4.7 83.3±3.7
Drawer Retrieval 16.7±3.8 50.0±5.1 75.0±4.4
Tool Use (Spatula) 25.0±4.4 54.2±5.1 79.2±4.1
Average 26.1±4.3 54.2±5.0 77.1±4.2

Ablation studies confirm the importance of each contribution: replacing soft RowNorm with hard Sinkhorn assignment reduces cross-embodiment success by approximately 11 points; using a fixed K=128K = 128 decreases performance by 4.5 points; omitting temporal coherence or compositional alignment each degrades performance by 3–5 points.

Qualitative analyses (t-SNE, prototype-activation timelines) corroborate the effective compositional encoding of skill segments and the overlap of human-robot embedding trajectories, including during transition intervals (e.g., "lift + rotate" in pouring tasks).

7. Limitations and Prospects

Key limitations include increased computational and memory overhead due to the entropy-based KK search and multi-prototype assignment, relative to single-clustered models. Experiments are limited to semi-structured laboratory environments; empirical validation in unstructured, real-world scenarios remains unaddressed.

Proposed directions for future research are:

  • Online refinement of the prototype vocabulary as new demonstrations become available.
  • Human-interpretable semantic grounding of discovered motion prototypes.
  • Extension of the framework to multi-agent coordination and deformable-object manipulation scenarios.
  • Improved computational efficiency for the adaptive prototype selection process.

UniPrototype establishes a compositional, entropy-adapted prototype vocabulary that demonstrably bridges human and robotic skill domains, yielding marked improvements in sample efficiency and robustness across a range of manipulation tasks (Hu et al., 27 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UniPrototype Framework.