AtomSkill Framework for Robotic Manipulation
- AtomSkill is a multi-task imitation learning framework that uses semantically grounded, variable-length atomic skills to address noisy, multi-modal robotic demonstrations.
- It employs contrastive clustering and vision–language annotation for precise segmentation and semantic alignment of skills from demonstration trajectories.
- A diffusion-based keypose imagination module enables robust long-horizon planning and composable action generation for versatile robotic manipulation tasks.
AtomSkill is a multi-task imitation learning framework developed to address the challenges of scalable robot manipulation across diverse tasks. Conventional imitation learning approaches excel in single-task domains but encounter performance degradation in multi-task settings due to demonstration noise, behavioral multi-modality, suboptimal skill segmentations, and limited abstraction for long-horizon planning. AtomSkill advances the domain by learning a structured Atomic Skill Space, enabling compositionality and semantic skill reuse without reliance on fixed-length segmentation or environment-specific priors. Its core contributions include the discovery of semantically grounded variable-length atomic skills, contrastive clustering for temporal and semantic coherence, and a Keypose Imagination module for robust chaining and action generation (Zhu et al., 20 Dec 2025).
1. Framework Architecture and Motivation
AtomSkill processes robot demonstrations comprising observation-action sequences and natural language instructions. The architecture is motivated by three challenges: noisy and multi-modal demonstrations, ambiguity from fixed-length skill segmentation, and a lack of long-horizon planning abstractions in prior skill-based methods. To address these issues, key components of the workflow are:
- Skill segmentation: Demonstrations are partitioned into non-overlapping segments where binary gripper state changes, ensuring segments align with meaningful contact events and atomic skill boundaries.
- Semantic annotation: A large pre-trained vision–LLM provides natural-language labels (e.g., “grasp,” “place”) for each segment, facilitating semantic grounding.
- Skill encoding and compression: Variable-length action sequences are encoded by a VQ-VAE-style latent encoder into latent tokens, which are quantized into a discrete codebook .
- Skill sampling for planning: A diffusion-based prior enables high-level skill planning by sampling from the atomic skill space during inference.
- Action decoding with keypose imagination: The action decoder incorporates current observations, skill embedding, and a “keypose” token to jointly predict short-term action sequences and long-horizon terminal keyposes.
This design facilitates efficient skill chaining and composability, with skill boundaries and transitions mediated by spatial proximity to predicted keyposes, rather than handcrafted heuristics.
2. Semantically Grounded Atomic Skill Library
The construction of a skill library within AtomSkill involves a multi-stage pipeline:
2.1 Gripper-State Keyframe Detection
Demonstration trajectory with instruction is segmented at time points where the gripper state flips (open/close), yielding variable-length segments . This mechanism aligns abstract skill representations with interaction events, supporting robust skill discovery.
2.2 Vision–Language Annotation
Each segment is annotated by querying a vision–LLM (e.g., Qwen2–VL) with segment images and the global instruction . The model generates segment descriptions and discrete semantic labels , yielding a labeled skill dataset .
2.3 Contrastive Clustering in Skill Embedding Space
Continuous skill embeddings are quantized into the codebook. To enforce the desired properties:
- VQ-VAE loss:
- Supervised contrastive objectives: Embeddings are clustered with respect to both intra-skill temporal coherence and inter-task semantic alignment:
- Temporal contrastive loss for consistency across token positions within a skill.
- Semantic contrastive loss for clustering across tasks sharing semantic labels.
- Total contrastive loss: .
This dual-objective structure yields a compact, discrete codebook facilitating generalization and skill reuse across tasks.
3. Action Generation and Skill Chaining via Keypose Imagination
The action decoder is designed to handle multi-modal inputs:
- Inputs: Multi-view images , proprioceptive state , task instruction , and the current discrete skill token sequence .
- Architecture: Cross-attention modules and temporal self-attention enhance context-integration and sequential dependency modeling.
- Outputs:
- Action chunk head: Predicts actions for an action horizon , with reconstruction loss .
- Keypose head: Predicts terminal action , providing long-horizon intent, trained with keypose loss .
During inference, skill embeddings sampled from determine the next skill to execute, and action chunk execution proceeds until the predicted action is within of the keypose, capturing the natural skill transition boundary.
4. Training and Inference Algorithms
4.1 Training Procedure
AtomSkill training proceeds in two stages according to the following algorithmic outline:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
for epoch in 1…N: for each demonstration τ with instruction L: segment at gripper-state changes → {τᵢ, sᵢ} for each segment τᵢ: a_seg ← actions in τᵢ a_resampled ← Resample(a_seg to length H) zₑ ← φθ(a_resampled) z_q ← quantize(zₑ, codebook E) ŷ, ŷ_key ← ψθ(Oₜ_start, z_q) compute ℒ_{VQ}(zₑ,z_q) compute ℒ_a(a_resampled, ŷ) compute ℒ_keypose(a_seg_end, ŷ_key) collect z embeddings for contrastive sets accumulate ℒ_contrastive over batch backprop ℒ = ℒ_{VQ} + β₁ℒ_a + β₂ℒ_contrastive + αℒ_keypose // Train diffusion sampler: for k in 1…K_steps: sample skill embeddings z⁰ from codebook add noise → zᵏ predict εθ(zᵏ,k,o,s) backprop ℒ_sampler |
Full objective:
4.2 Inference Workflow
Inference proceeds in iterative skill selection and execution:
1 2 3 4 5 6 7 8 9 10 |
t = 1; read initial observation O₁ while t < T_max: sample z_h ~ ρθ(Noise, s, Oₜ) z_q = quantize(z_h, E) repeat: ŷ, ŷ_key = ψθ(Oₜ, z_q) execute ŷ[1] on robot; shift ŷ window; t += 1 if dist(ŷ[1], ŷ_key) < ϵ: break // transition to next skill Oₜ = read new observation |
Skill transitions are governed by proximity of executed actions to predicted keyposes, supporting robust chaining.
5. Implementation Specifications and Experimental Results
5.1 Network Design and Hyperparameters
- Encoder : 1D-CNN layers with 6-layer self-attention, output tokens.
- Codebook: entries, commitment .
- Decoder : Cross-attention module with 7 layers, 8 heads; action horizon .
- Diffusion sampler : CNN-based U-Net with FiLM conditioning.
- Optimization: Learning rates , weight decay , batch size $256$, loss weights , , .
5.2 Empirical Evaluation
AtomSkill demonstrates superior quantitative performance in multi-task robotic manipulation:
| Setting | ATP | SR (%) | Comparison Benchmarks |
|---|---|---|---|
| RLBench (6 tasks) | 0.68 | 67.2 | DP: 0.54/37.2, ACT: 0.55/46.7, VQ-BeT: 0.10/5.0, QueST: 0.39/30.0 |
| Real-world bimanual (300 demos) | 0.60 | — | ACT: 0.34, RDT: 0.28 |
Ablation studies reveal the necessity of contrastive losses (ATP drops to 0.33 without and ), and keypose imagination yields significant performance gains, especially on spatially localized tasks (ATP up from 0.61 to 0.68, SR up from 53.9% to 67.2%).
This suggests that semantically grounded, temporally coherent atomic skills coupled with keypose-conditioned action decoding materially improve composability and robustness in multi-task manipulation.