SPECI: Hierarchical Continual Imitation Learning
- The paper introduces a hierarchical policy architecture that decomposes imitation learning into perception, skill inference, and action execution.
- It employs an expandable skill codebook with attention-based selection to transfer knowledge across tasks and prevent forgetting.
- Empirical results demonstrate improved forward and backward transfer with state-of-the-art metrics on the LIBERO benchmark.
Skill Prompts-based Hierarchical Continual Imitation Learning (SPECI) is a hierarchical policy architecture developed to address lifelong robot manipulation in dynamic and unstructured environments, where adaptability to evolving objects, scenes, and tasks is essential. SPECI overcomes limitations of traditional and existing continual imitation learning (CIL) methods, which either overlook intrinsic manipulation skills or use manually specified and rigid skills, thereby constraining cross-task knowledge transfer. The framework integrates multimodal sensory fusion, dynamic skill inference, attention-driven skill reuse, and task-aware parameter isolation through mode approximation, providing state-of-the-art bidirectional knowledge transfer in continual robot manipulation learning (Xu et al., 22 Apr 2025).
1. Hierarchical Policy Architecture
SPECI decomposes the imitation-learning policy into three sequential modules:
- Multimodal Perception and Fusion: This module ingests wrist and workspace RGB images, joint angles, gripper states, and a natural-language task description . A frozen CLIP text encoder processes the language input, projected via an MLP. Two ResNet-18 encoders (one per camera view) extract visual representations, modulated by the language tokens using FiLM. Proprioceptive data are encoded with a compact MLP. These outputs are temporally concatenated and passed through a transformer encoder, yielding a fused state embedding .
- High-level Skill Inference: An expandable skill codebook maintains skill vectors , with corresponding key matrix and attention vector tensor . At each timestep, cosine similarity between the fused state representation (after element-wise attention modulation) and keys yields raw weights , from which the top- are softmaxed and weighted to form a prompt . This prompt, split into prefix key and prefix value , is injected into every multi-head cross-attention transformer block via prefix-tuning. The decoder outputs a latent skill variable .
- Low-level Action Execution: A temporal transformer decoder consumes a windowed sequence of latent ’s and, via an MLP head, parameterizes an -component Gaussian Mixture Model (GMM) over actions . The action likelihood is
and imitation proceeds by minimizing the behavioral cloning (BC) loss with respect to ground-truth actions.
This separation of perception, skill inference, and action ensures modularity and enables explicit modeling of skill acquisition and reuse.
2. Problem Formulation and Continual Learning Objectives
Each manipulation task is a Markov Decision Process with sparse reward induced by a predicate . The learning objective is to maximize expected goal completions via
In the continual imitation learning setting, a stream of tasks is presented, each with demonstrations under a strict replay-free protocol. The shared policy is optimized over the average BC loss:
SPECI factorizes the policy using the latent skill variable:
clarifying hierarchical decision structure.
3. Key Components: Expandable Skill Codebook and Attention-based Selection
SPECI utilizes an expandable codebook for dynamic and implicit skill discovery:
- Codebook Expansion: For task , the previous skill slots, keys, and attention vectors are frozen, and new slots initialized; overall. Only newly added slots are trainable for the new task, ensuring prior skills are preserved (no catastrophic forgetting).
- Attention-driven Skill Reuse: At inference, the fused state embedding is used to compute cosine similarities with all codebook entries (modulated with attention weights), followed by top- selection and softmax normalization:
where and are keys. The synthesized skill prompt is injected as prefix tuning in all cross-attention blocks. This facilitates both forward transfer (using established skills for unseen tasks) and backward transfer (improved old-task performance when new skill embeddings are learned).
The table below summarizes the codebook process:
| Step | Parameter Frozen? | Update/Train |
|---|---|---|
| Add new skill slots | No | Yes (current task) |
| Prior skill slots | Yes | No |
| Attend over all skills | N/A | Attention weights only |
4. Mode Approximation for Task-specific and Shared Parameters
To isolate task-specific representations while supporting shared knowledge transfer, SPECI introduces mode approximation through low-rank CP (CANDECOMP/PARAFAC) decomposition on transformer attention block weights:
- For each new task, weight tensors in skill-inference and action-execution decoders are augmented by task-specific low-rank tensors:
with , as global factors and , as task-specific factors.
- The overall attention operation is:
enabling simultaneous parameter sharing and isolation, improving both forward and backward task transfer across the skill and action hierarchy.
5. Training Regimen and Experimental Protocol
- Sequential Task Curriculum: For each new task, new skill codebook slots, key/attention vectors, and CP factors are initialized; all previous slots and CP factors are frozen.
- Optimization: The complete hierarchy (except frozen parameters) is trained end-to-end for up to 50 epochs using the sum of BC losses from both high-level and low-level modules; early stopping is triggered when the per-task success rate exceeds 95% and then falls twice consecutively.
- Hyperparameters: Embedding dimension , transformer MLP size $1536$, skill slots and attention top- per task are both 10, and CP rank is set to 64.
- No demonstration replay is permitted beyond the currently trained task, conforming to the strict continual learning protocol.
6. Empirical Evaluation and Performance
Experiments are conducted on the LIBERO benchmark, which comprises four 10-task suites: OBJECT (object identities), GOAL (goal predicates), SPATIAL (spatial reference learning), and LONG (long-horizon with subgoals). The evaluation uses multiple baselines: flat policies (ResNet-RNN/Transformer, ViT-Transformer), hierarchical architectures (BUDS, LOTUS), and continual learning strategies (SEQUENTIAL, MULTITASK, ER, PackNet).
- Metrics:
- Forward-Transfer (FWT): Average zero-shot/early-training success on new tasks.
- Negative Backward-Transfer (NBT): Drop in old tasks' performance after learning new tasks.
- Area-Under-Curve (AUC): Aggregate success over all tasks and epochs.
- Results:
- Under experience replay (ER), SPECI achieves FWT improvements over LOTUS of +9% (OBJECT), +10% (GOAL), +2% (SPATIAL), +10% (LONG).
- SPECI records the lowest NBT on three suites and even negative NBT under PackNet.
- AUC improvements: e.g., 0.78 vs. 0.65 (OBJECT), 0.66 vs. 0.56 (GOAL).
- Under PackNet, SPECI further reduces forgetting (NBT-0.01) and increases AUC to 0.85/0.87 (OBJECT/GOAL).
- Against the MULTITASK upper bound, SPECI closes the FWT gap to 6% on most suites and surpasses MULTITASK by 4% on LONG, attributed to its mode-approximation isolation.
7. Qualitative Analysis and Insights
Skill selection visualization demonstrates that SPECI's active skill set after learning a new task includes skills originating from multiple prior tasks, indicating unsupervised forward and backward skill composition without manual subgoal specification. Zero-shot transfer, with up to 30% success on new OBJECT suite tasks, arises from retrieval of relevant prior skill prompts. Backward transfer is observed, as learning new related subskills in some tasks improves policy performance on previously learned tasks.
Rare failures relate to insufficient codebook granularity for complex, highly composite long-horizon tasks, suggesting further codebook decomposition or hierarchical extension as promising directions.
SPECI combines scalable skill codebook expansion with attention-driven retrieval and task-disentangled parameter isolation, yielding enhanced bidirectional transfer and state-of-the-art performance in lifelong robot manipulation learning (Xu et al., 22 Apr 2025).