Label-Free Imitation: Methods & Applications

Updated 22 January 2026

Label-free imitation is a framework enabling agents to mimic expert behavior using only state trajectories without explicit action labels.
It employs diverse methodologies such as value learning, generative models, and adversarial discriminators to extract behavioral patterns from observations.
Key applications in robotics include dexterous manipulation, autonomous control, and language-based interactions, demonstrating enhanced sample efficiency and robustness.

Label-free imitation refers to the family of imitation learning (IL) and learning-from-observation (LfO) frameworks in which the agent aims to match expert behavior using only state (or observation) trajectories, but never accesses action labels or explicit reward/cost annotations. This formulation is central to domains where action supervision is unavailable—e.g., raw video recordings or human demonstrations in the wild. Recent advances have produced scalable, robust, and sample-efficient algorithms for label-free imitation across continuous control, dexterous manipulation, robotic vision, and even language-based interaction, driving progress in autonomous robotics and embodied intelligence.

1. Formal Definition and Core Problem Statement

Given a dataset of expert demonstrations $\mathcal{D} = \{\tau_i\}$ , with each trajectory $\tau_i = (s^*_0, ..., s^*_{T-1})$ comprising only state (or observation) sequences, label-free imitation seeks to learn a policy $\pi(a|s)$ such that the agent’s behavior closely matches the expert’s, without observation of expert actions or environment rewards. The challenge lies in extracting sufficient behavioral structure from state-only data to enable faithful reproduction, typically by leveraging state transition statistics, surrogate objectives, or generative models (Edwards et al., 2019, Jaegle et al., 2021, Han et al., 2024).

Explicitly, the label-free imitation agent must solve for

$\max_{\pi} \, \mathbb{E}_{\pi} \left[ R(s_0, ..., s_{T-1}) \right]$

where $R$ is an implicit or learned return function, with no direct access to action labels $a^*_t$ or expert-generated rewards.

2. Principal Methodological Approaches

Several distinct algorithmic families dominate label-free imitation:

Value Learning from Observation: Approaches such as Perceptual Values from Observation (PVO) directly learn a value function $V^*(s_t) = \gamma^{T-t-1}$ via supervised regression on trajectory time indices, using an assumed surrogate reward at terminal states, then employ this value prediction for reward shaping or bootstrapping within RL (Edwards et al., 2019).
Observation-Conditional Generative Models: Future Observation Reward Model (FORM) learns generative “effect” models $p^D_\omega(x_t|x_{t-1})$ and $p^I_\phi(x_t|x_{t-1})$ for expert and agent transitions, and constructs a reward as the log-likelihood ratio between these two conditional densities (Jaegle et al., 2021).
Adversarial Imitation from Transition Distributions: State-of-the-art LfO algorithms train discriminators (or diffusion-based classifiers) to distinguish expert and agent state transitions $(s, s')$ , using their binary “realness” predictions as dense RL rewards for policy optimization. The discrimination criterion may be parameterized by deep networks or diffusion models (e.g., DIFO) (Huang et al., 2024, Monteiro et al., 2023).
Self-Supervised Action Inference: Techniques such as SAIL learn an inverse-dynamics model $f_\phi(s_t, s_{t+1})$ purely from agent rollouts, then use these pseudo-actions for behavioral cloning on the expert state pairs, bootstrapping the action space from self-exploration (Monteiro et al., 2023).
Structured Representation and Goal-based Tracking: Third-person and cross-embodiment visual imitation, as in MIR and GSVI, use advanced contrastive and graph-structured representations to align timesteps or spatial entities between demonstrator and agent, defining trajectory-matching costs in learned feature spaces (Zhou et al., 2021, Sieb et al., 2019).

The diversity of approaches reflects the ill-posedness of inferring intent and feedback from observations alone; each method operationalizes a different “proxy” of expert quality.

3. Architectures and Loss Functions

Distinct loss formulations are central to these methodologies. Selected examples:

Method	Core Loss Function	Reference
PVO	$L_V(\theta)=\mathbb{E}[(V_\theta(s^*_t)-\gamma^{T-t-1})^2]$	(Edwards et al., 2019)
FORM	$r_t = \log p^D_\omega(x_t\|x_{t-1}) - \log p^I_\phi(x_t\|x_{t-1})$	(Jaegle et al., 2021)
DIFO	$r_\phi(s,s') = \log(1-D_\phi(s,s'))$ via diffusion classifier	(Huang et al., 2024)
SAIL	$\mathcal{L}_D = -\mathbb{E}[\log D_{\psi}(s,s')] ...$ (see details)	(Monteiro et al., 2023)
MIR	$\mathcal{L}_{MIR} = \mathcal{L}_{align} + \mathcal{L}_{temp} + \lambda \mathcal{L}_{act}$	(Zhou et al., 2021)
GSVI	$\mathcal{C}(G^D, G^A)_t = \sum_{ij} w_{ij} \\| \Delta^D_{ij} - \Delta^A_{ij} \\|_2$	(Sieb et al., 2019)

Notably, most modern methods discard reliance on manual feature engineering, ground-truth action labels, or externally-provided rewards, instead defining learning signals through generative, adversarial, or contrastive alignment in state(-pair) space.

4. Applications in Robotics and Embodied AI

Label-free imitation has demonstrated empirical effectiveness across a spectrum of robotic skills and benchmarks:

Dexterous Manipulation: Frameworks such as CIMER achieve >90% success in tool use and relocation tasks using only state-only demonstrations. Structured Koopman dynamical priors followed by RL-based motion refinement enable transfer and adaptation to novel objects (Han et al., 2024).
One-Shot Multi-Step Manipulation: Annotation-free one-shot imitation frameworks leverage pre-trained vision-LLMs and pose alignment to achieve ~82.5% average success in long-horizon tasks (block stacking, tea preparation), with no additional model training or annotation (Wichitwechkarn et al., 29 Sep 2025).
Visual Manipulation and Transfer: Visual Imitation Made Easy demonstrates open-loop policy learning from pure RGB videos using off-the-shelf SfM and finger-detection, achieving 87.5% success in pushing and 62.5% in stacking on unseen objects (Young et al., 2020).
Cross-Embodiment and Third-Person Imitation: MIR enables high-fidelity imitation from human-hand or robot-arm demonstrations across domains, using manipulator-independent embeddings, with stacking success rates exceeding all baselines for non-robot embodiment (Zhou et al., 2021).
Conversational Agents: Reward-free GAIL frameworks attain optimal multi-turn conversational policies in search contexts, matching or exceeding RL baselines without annotated labels or hand-designed rewards (Wang et al., 2023).

5. Comparative Advantages and Limitations

Label-free imitation methods present several advantages:

No Requirement for Privileged Supervision: State-only approaches bypass the need for joint encoders, teleoperation interfaces, or reward engineering, enabling in-the-wild dataset collection and broader applicability (Young et al., 2020, Wichitwechkarn et al., 29 Sep 2025).
Sample and Generalization Efficiency: Algorithms such as PVO and DIFO yield dense reward landscapes and accelerate RL (up to $5\times$ faster than sparse-reward PPO) (Edwards et al., 2019, Huang et al., 2024). They exhibit higher robustness to distractors or visual domain shifts, especially when leveraging diffusion-based discriminators or local transition models (Jaegle et al., 2021, Huang et al., 2024).

Challenges and open problems include:

Surrogate Reward Bias and Goal Assumptions: Many pipeline rely on assumptions that the final expert state is a “goal,” which may not generalize to multi-goal or cyclic tasks (Edwards et al., 2019).
State-Space Alignment: Cross-embodiment methods (e.g., human to robot) require state space translation or representation learning; most existing approaches assume agent and expert share the same observation space (Huang et al., 2024, Zhou et al., 2021).
Sensitivity to Model Regularization and Data Coverage: Sufficient state-transition coverage by expert demos is critical; under-coverage or high-dimensional distractor features can degrade performance unless architectures explicitly focus on local conditionals or actionability (Jaegle et al., 2021, Monteiro et al., 2023).
Expensive Pre-training or Computation: Methods involving large generative models, diffusion processes, or pre-trained vision models can be computationally expensive compared to classic behavior cloning (Huang et al., 2024, Wichitwechkarn et al., 29 Sep 2025).

6. Future Directions and Open Research Problems

Promising ongoing or future research questions include:

Relaxing Terminal-State Assumptions: Flexible weighting of goal states and learned termination criteria for cyclic or open-world tasks (Edwards et al., 2019).
Scaling to High-Dimensional, Pixel-Based Observations: Robust generative and adversarial models for pixel-space transitions remain challenging due to capacity, regularization, and coverage constraints (Jaegle et al., 2021, Huang et al., 2024).
Semantic Keypoint Discovery and Closed-Loop Correction: Improving alignment and generalization in one-shot manipulation via online re-estimation, semantic correspondence, and closed-loop feedback (Wichitwechkarn et al., 29 Sep 2025).
Uncertainty Quantification and Out-of-Distribution Detection: Integrating calibrated uncertainty estimates into value or transition models to prevent overconfident policy shaping in novel states (Edwards et al., 2019).
Incorporation of Language and Symbolic Abstractions: Expanding label-free frameworks to tasks requiring human-interpretable reasoning, multi-modality, or high-level instruction following (Wichitwechkarn et al., 29 Sep 2025, Wang et al., 2023).

Label-free imitation research bridges the gap between passive observational learning and actively supervised imitation, supporting highly scalable, annotation-free skill acquisition in robotics, manipulation, and autonomous agents.