Papers
Topics
Authors
Recent
Search
2000 character limit reached

AtomSkill Framework for Robotic Manipulation

Updated 3 January 2026
  • AtomSkill is a multi-task imitation learning framework that uses semantically grounded, variable-length atomic skills to address noisy, multi-modal robotic demonstrations.
  • It employs contrastive clustering and vision–language annotation for precise segmentation and semantic alignment of skills from demonstration trajectories.
  • A diffusion-based keypose imagination module enables robust long-horizon planning and composable action generation for versatile robotic manipulation tasks.

AtomSkill is a multi-task imitation learning framework developed to address the challenges of scalable robot manipulation across diverse tasks. Conventional imitation learning approaches excel in single-task domains but encounter performance degradation in multi-task settings due to demonstration noise, behavioral multi-modality, suboptimal skill segmentations, and limited abstraction for long-horizon planning. AtomSkill advances the domain by learning a structured Atomic Skill Space, enabling compositionality and semantic skill reuse without reliance on fixed-length segmentation or environment-specific priors. Its core contributions include the discovery of semantically grounded variable-length atomic skills, contrastive clustering for temporal and semantic coherence, and a Keypose Imagination module for robust chaining and action generation (Zhu et al., 20 Dec 2025).

1. Framework Architecture and Motivation

AtomSkill processes robot demonstrations comprising observation-action sequences and natural language instructions. The architecture is motivated by three challenges: noisy and multi-modal demonstrations, ambiguity from fixed-length skill segmentation, and a lack of long-horizon planning abstractions in prior skill-based methods. To address these issues, key components of the workflow are:

  • Skill segmentation: Demonstrations are partitioned into non-overlapping segments {τ1,,τn}\{\tau_1,\dots,\tau_n\} where binary gripper state changes, ensuring segments align with meaningful contact events and atomic skill boundaries.
  • Semantic annotation: A large pre-trained vision–LLM provides natural-language labels (e.g., “grasp,” “place”) for each segment, facilitating semantic grounding.
  • Skill encoding and compression: Variable-length action sequences are encoded by a VQ-VAE-style latent encoder ϕθ\phi_\theta into nn latent tokens, which are quantized into a discrete codebook E={ek}k=1KE=\{e_k\}_{k=1}^K.
  • Skill sampling for planning: A diffusion-based prior ρθ\rho_\theta enables high-level skill planning by sampling from the atomic skill space during inference.
  • Action decoding with keypose imagination: The action decoder ψθ\psi_\theta incorporates current observations, skill embedding, and a “keypose” token to jointly predict short-term action sequences and long-horizon terminal keyposes.

This design facilitates efficient skill chaining and composability, with skill boundaries and transitions mediated by spatial proximity to predicted keyposes, rather than handcrafted heuristics.

2. Semantically Grounded Atomic Skill Library

The construction of a skill library within AtomSkill involves a multi-stage pipeline:

2.1 Gripper-State Keyframe Detection

Demonstration trajectory τ={(Ot,at)}t=1T\tau = \{(O_t, a_t)\}_{t=1}^T with instruction LL is segmented at time points where the gripper state flips (open/close), yielding variable-length segments τi\tau_i. This mechanism aligns abstract skill representations with interaction events, supporting robust skill discovery.

2.2 Vision–Language Annotation

Each segment τi\tau_i is annotated by querying a vision–LLM (e.g., Qwen2–VL) with segment images and the global instruction LL. The model generates segment descriptions LsiL_{s_i} and discrete semantic labels siSs_i \in \mathcal{S}, yielding a labeled skill dataset {(τi,Lsi,si)}\{(\tau_i, L_{s_i}, s_i)\}.

2.3 Contrastive Clustering in Skill Embedding Space

Continuous skill embeddings ze=(ze1,,zen)z_e = (z_e^1, \dots, z_e^n) are quantized into the codebook. To enforce the desired properties:

LVQ=sg(ze)zq22+λzesg(zq)22\mathcal{L}_{VQ} = \|sg(z_e) - z_q\|_2^2 + \lambda \| z_e - sg(z_q) \|_2^2

  • Supervised contrastive objectives: Embeddings are clustered with respect to both intra-skill temporal coherence and inter-task semantic alignment:
    • Temporal contrastive loss Ltemp\mathcal{L}_{temp} for consistency across token positions within a skill.
    • Semantic contrastive loss Lskill\mathcal{L}_{skill} for clustering across tasks sharing semantic labels.
    • Total contrastive loss: Lcontrastive=Ltemp+Lskill\mathcal{L}_{contrastive} = \mathcal{L}_{temp} + \mathcal{L}_{skill}.

This dual-objective structure yields a compact, discrete codebook facilitating generalization and skill reuse across tasks.

3. Action Generation and Skill Chaining via Keypose Imagination

The action decoder ψθ\psi_\theta is designed to handle multi-modal inputs:

  • Inputs: Multi-view images ItI_t, proprioceptive state ptp_t, task instruction LL, and the current discrete skill token sequence zqz_q.
  • Architecture: Cross-attention modules and temporal self-attention enhance context-integration and sequential dependency modeling.
  • Outputs:
    • Action chunk head: Predicts actions (a^t,,a^t+H1)(\hat{a}_t, \dots, \hat{a}_{t+H-1}) for an action horizon HH, with reconstruction loss La\mathcal{L}_a.
    • Keypose head: Predicts terminal action a^keypose\hat{a}_{keypose}, providing long-horizon intent, trained with keypose loss Lkeypose\mathcal{L}_{keypose}.

During inference, skill embeddings sampled from ρθ\rho_\theta determine the next skill to execute, and action chunk execution proceeds until the predicted action is within ϵ\epsilon of the keypose, capturing the natural skill transition boundary.

4. Training and Inference Algorithms

4.1 Training Procedure

AtomSkill training proceeds in two stages according to the following algorithmic outline:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
for epoch in 1N:
  for each demonstration τ with instruction L:
    segment at gripper-state changes  {τᵢ, sᵢ}
    for each segment τᵢ:
      a_seg  actions in τᵢ
      a_resampled  Resample(a_seg to length H)
      zₑ  φθ(a_resampled)
      z_q  quantize(zₑ, codebook E)
      ŷ, ŷ_key  ψθ(Oₜ_start, z_q)
      compute ℒ_{VQ}(zₑ,z_q)
      compute ℒ_a(a_resampled, ŷ)
      compute ℒ_keypose(a_seg_end, ŷ_key)
      collect z embeddings for contrastive sets
    accumulate ℒ_contrastive over batch
    backprop ℒ = ℒ_{VQ} + βℒ_a + βℒ_contrastive + αℒ_keypose
// Train diffusion sampler:
for k in 1K_steps:
  sample skill embeddings z from codebook
  add noise  zᵏ
  predict εθ(zᵏ,k,o,s)
  backprop ℒ_sampler

Full objective:

L=LVQ+β1La+β2Lcontrastive+αLkeypose+γLsampler\mathcal{L} = \mathcal{L}_{VQ} + \beta_1 \mathcal{L}_a + \beta_2 \mathcal{L}_{contrastive} + \alpha \mathcal{L}_{keypose} + \gamma \mathcal{L}_{sampler}

4.2 Inference Workflow

Inference proceeds in iterative skill selection and execution:

1
2
3
4
5
6
7
8
9
10
t = 1; read initial observation O
while t < T_max:
  sample z_h ~ ρθ(Noise, s, Oₜ)
  z_q = quantize(z_h, E)
  repeat:
    ŷ, ŷ_key = ψθ(Oₜ, z_q)
    execute ŷ[1] on robot; shift ŷ window; t += 1
    if dist(ŷ[1], ŷ_key) < ϵ:
      break  // transition to next skill
    Oₜ = read new observation

Skill transitions are governed by proximity of executed actions to predicted keyposes, supporting robust chaining.

5. Implementation Specifications and Experimental Results

5.1 Network Design and Hyperparameters

  • Encoder ϕθ\phi_\theta: 1D-CNN layers with 6-layer self-attention, output n=8n=8 tokens.
  • Codebook: K=32K=32 entries, commitment λ=0.25\lambda=0.25.
  • Decoder ψθ\psi_\theta: Cross-attention module with 7 layers, 8 heads; action horizon H=32H=32.
  • Diffusion sampler ϵθ\epsilon_\theta: CNN-based U-Net with FiLM conditioning.
  • Optimization: Learning rates 10410^{-4}, weight decay 10510^{-5}, batch size $256$, loss weights β1=1\beta_1=1, β2=102\beta_2=10^{-2}, α=1\alpha=1.

5.2 Empirical Evaluation

AtomSkill demonstrates superior quantitative performance in multi-task robotic manipulation:

Setting ATP SR (%) Comparison Benchmarks
RLBench (6 tasks) 0.68 67.2 DP: 0.54/37.2, ACT: 0.55/46.7, VQ-BeT: 0.10/5.0, QueST: 0.39/30.0
Real-world bimanual (300 demos) 0.60 ACT: 0.34, RDT: 0.28

Ablation studies reveal the necessity of contrastive losses (ATP drops to 0.33 without Ltemp\mathcal{L}_{temp} and Lskill\mathcal{L}_{skill}), and keypose imagination yields significant performance gains, especially on spatially localized tasks (ATP up from 0.61 to 0.68, SR up from 53.9% to 67.2%).

This suggests that semantically grounded, temporally coherent atomic skills coupled with keypose-conditioned action decoding materially improve composability and robustness in multi-task manipulation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AtomSkill Framework.