SPECI: Hierarchical Continual Imitation Learning

Updated 21 January 2026

The paper introduces a hierarchical policy architecture that decomposes imitation learning into perception, skill inference, and action execution.
It employs an expandable skill codebook with attention-based selection to transfer knowledge across tasks and prevent forgetting.
Empirical results demonstrate improved forward and backward transfer with state-of-the-art metrics on the LIBERO benchmark.

Skill Prompts-based Hierarchical Continual Imitation Learning (SPECI) is a hierarchical policy architecture developed to address lifelong robot manipulation in dynamic and unstructured environments, where adaptability to evolving objects, scenes, and tasks is essential. SPECI overcomes limitations of traditional and existing continual imitation learning (CIL) methods, which either overlook intrinsic manipulation skills or use manually specified and rigid skills, thereby constraining cross-task knowledge transfer. The framework integrates multimodal sensory fusion, dynamic skill inference, attention-driven skill reuse, and task-aware parameter isolation through mode approximation, providing state-of-the-art bidirectional knowledge transfer in continual robot manipulation learning (Xu et al., 22 Apr 2025).

1. Hierarchical Policy Architecture

SPECI decomposes the imitation-learning policy $\pi$ into three sequential modules:

Multimodal Perception and Fusion: This module ingests wrist and workspace RGB images, joint angles, gripper states, and a natural-language task description $\ell$ . A frozen CLIP text encoder processes the language input, projected via an MLP. Two ResNet-18 encoders (one per camera view) extract visual representations, modulated by the language tokens using FiLM. Proprioceptive data are encoded with a compact MLP. These outputs are temporally concatenated and passed through a transformer encoder, yielding a fused state embedding $s_t^{k,e} \in \mathbb{R}^{B \times L \times d}$ .
High-level Skill Inference: An expandable skill codebook $P \in \mathbb{R}^{B \times m \times 2 \times d}$ maintains $m$ skill vectors $p^c \in \mathbb{R}^{2 \times d}$ , with corresponding key matrix $K \in \mathbb{R}^{B \times m \times d}$ and attention vector tensor $A \in \mathbb{R}^{B \times m \times d}$ . At each timestep, cosine similarity between the fused state representation (after element-wise attention modulation) and keys yields raw weights $\alpha$ , from which the top- $C$ are softmaxed and weighted to form a prompt $\tilde{p} = \sum_{c=1}^C \alpha_c p_c$ . This prompt, split into prefix key $p_K$ and prefix value $p_V$ , is injected into every multi-head cross-attention transformer block via prefix-tuning. The decoder outputs a latent skill variable $z_t \in \mathbb{R}^d$ .
Low-level Action Execution: A temporal transformer decoder consumes a windowed sequence of latent $z$ ’s and, via an MLP head, parameterizes an $R$ -component Gaussian Mixture Model (GMM) over actions $a_t$ . The action likelihood is

$p(\tau|\theta) = \sum_{r=1}^R \eta_r \mathcal{N}(\tau|\mu_r, \sigma_r),$

and imitation proceeds by minimizing the behavioral cloning (BC) loss $\mathcal{L}_{GMM}$ with respect to ground-truth actions.

This separation of perception, skill inference, and action ensures modularity and enables explicit modeling of skill acquisition and reuse.

2. Problem Formulation and Continual Learning Objectives

Each manipulation task $\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{T}, H, \mu_0, R)$ is a Markov Decision Process with sparse reward $R$ induced by a predicate $g(s) \in \{0,1\}$ . The learning objective is to maximize expected goal completions via

$\max_{\pi} J(\pi) = \mathbb{E}_{s_t, a_t \sim \pi, \mu_0}\Bigg[\sum_{t=1}^H g(s_t)\Bigg].$

In the continual imitation learning setting, a stream of $K$ tasks $\mathcal{T}^1, ..., \mathcal{T}^K$ is presented, each with $N$ demonstrations $D^k$ under a strict replay-free protocol. The shared policy $\pi(s;\ell)$ is optimized over the average BC loss:

$\min_{\pi} J_{BC}(\pi) = \frac{1}{k} \sum_{j=1}^k \mathbb{E}_{(s_t^j, a_t^j) \sim D^j} \Bigg[ \sum_{t=0}^{i_j} \mathcal{L}(\pi(s_t^j;\ell^j), a_t^j) \Bigg].$

SPECI factorizes the policy using the latent skill variable:

$\pi(a_t^{(k)}, z_t | s_t^{(k)}, \ell^{(k)}) = \pi^W(z_t | s_t^{(k)}, \ell^{(k)}) \cdot \pi^L(a_t^{(k)} | z_t, s_t^{(k)}, \ell^{(k)}),$

clarifying hierarchical decision structure.

3. Key Components: Expandable Skill Codebook and Attention-based Selection

SPECI utilizes an expandable codebook for dynamic and implicit skill discovery:

Codebook Expansion: For task $k$ , the previous $(k-1)M$ skill slots, keys, and attention vectors are frozen, and $M$ new slots initialized; $m = kM$ overall. Only newly added slots are trainable for the new task, ensuring prior skills are preserved (no catastrophic forgetting).
Attention-driven Skill Reuse: At inference, the fused state embedding is used to compute cosine similarities with all codebook entries (modulated with attention weights), followed by top- $C$ selection and softmax normalization:

$\gamma(q, K_i) = \frac{q \cdot K_i}{\|q\|\|K_i\|},$

where $q = s_t^e \odot A$ and $K_i$ are keys. The synthesized skill prompt $\tilde p$ is injected as prefix tuning in all cross-attention blocks. This facilitates both forward transfer (using established skills for unseen tasks) and backward transfer (improved old-task performance when new skill embeddings are learned).

The table below summarizes the codebook process:

Step	Parameter Frozen?	Update/Train
Add new skill slots	No	Yes (current task)
Prior skill slots	Yes	No
Attend over all skills	N/A	Attention weights only

4. Mode Approximation for Task-specific and Shared Parameters

To isolate task-specific representations while supporting shared knowledge transfer, SPECI introduces mode approximation through low-rank CP (CANDECOMP/PARAFAC) decomposition on transformer attention block weights:

For each new task, weight tensors in skill-inference and action-execution decoders are augmented by task-specific low-rank tensors:

$W^k = \sum_{r=1}^R \lambda_r^k (u_r \circ v_r \circ q_r),$

with $U$ , $V$ as global factors and $Q^k$ , $\lambda^k$ as task-specific factors.

The overall attention operation is:

$\mathcal{H}^k = W^o X^k + \Big(\sum_{r=1}^R \lambda_r^k (u_r \circ v_r \circ q_r)\Big) X^k,$

enabling simultaneous parameter sharing and isolation, improving both forward and backward task transfer across the skill and action hierarchy.

5. Training Regimen and Experimental Protocol

Sequential Task Curriculum: For each new task, new skill codebook slots, key/attention vectors, and CP factors are initialized; all previous slots and CP factors are frozen.
Optimization: The complete hierarchy (except frozen parameters) is trained end-to-end for up to 50 epochs using the sum of BC losses from both high-level and low-level modules; early stopping is triggered when the per-task success rate exceeds 95% and then falls twice consecutively.
Hyperparameters: Embedding dimension $d=384$ , transformer MLP size $1536$, skill slots and attention top- $C$ per task are both 10, and CP rank is set to 64.
No demonstration replay is permitted beyond the currently trained task, conforming to the strict continual learning protocol.

6. Empirical Evaluation and Performance

Experiments are conducted on the LIBERO benchmark, which comprises four 10-task suites: OBJECT (object identities), GOAL (goal predicates), SPATIAL (spatial reference learning), and LONG (long-horizon with subgoals). The evaluation uses multiple baselines: flat policies (ResNet-RNN/Transformer, ViT-Transformer), hierarchical architectures (BUDS, LOTUS), and continual learning strategies (SEQUENTIAL, MULTITASK, ER, PackNet).

Metrics:
- Forward-Transfer (FWT): Average zero-shot/early-training success on new tasks.
- Negative Backward-Transfer (NBT): Drop in old tasks' performance after learning new tasks.
- Area-Under-Curve (AUC): Aggregate success over all tasks and epochs.
Results:
- Under experience replay (ER), SPECI achieves FWT improvements over LOTUS of +9% (OBJECT), +10% (GOAL), +2% (SPATIAL), +10% (LONG).
- SPECI records the lowest NBT on three suites and even negative NBT under PackNet.
- AUC improvements: e.g., 0.78 vs. 0.65 (OBJECT), 0.66 vs. 0.56 (GOAL).
- Under PackNet, SPECI further reduces forgetting (NBT $\approx$ -0.01) and increases AUC to 0.85/0.87 (OBJECT/GOAL).
- Against the MULTITASK upper bound, SPECI closes the FWT gap to $<$ 6% on most suites and surpasses MULTITASK by 4% on LONG, attributed to its mode-approximation isolation.

7. Qualitative Analysis and Insights

Skill selection visualization demonstrates that SPECI's active skill set after learning a new task includes skills originating from multiple prior tasks, indicating unsupervised forward and backward skill composition without manual subgoal specification. Zero-shot transfer, with up to 30% success on new OBJECT suite tasks, arises from retrieval of relevant prior skill prompts. Backward transfer is observed, as learning new related subskills in some tasks improves policy performance on previously learned tasks.

Rare failures relate to insufficient codebook granularity for complex, highly composite long-horizon tasks, suggesting further codebook decomposition or hierarchical extension as promising directions.

SPECI combines scalable skill codebook expansion with attention-driven retrieval and task-disentangled parameter isolation, yielding enhanced bidirectional transfer and state-of-the-art performance in lifelong robot manipulation learning (Xu et al., 22 Apr 2025).

Markdown Report Issue Upgrade to Chat

References (1)

SPECI: Skill Prompts based Hierarchical Continual Imitation Learning for Robot Manipulation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Skill Prompts-based Hierarchical Continual Imitation Learning (SPECI).