Papers
Topics
Authors
Recent
Search
2000 character limit reached

Unsupervised Imitation Learning from Observation

Updated 31 January 2026
  • UfO is a framework that learns control policies from expert state sequences without access to action data or reward signals.
  • It unifies distribution matching, adversarial optimization, and model-based planning to improve sample efficiency and stability.
  • UfO methods have demonstrated expert-level performance in continuous control and robotics with significantly fewer rollouts.

Unsupervised Imitation Learning from Observation (UfO) is the problem of learning a control policy solely from sequences of expert states (or high-dimensional observations) without access to expert actions or environment rewards. The goal is for the learner's behavior to match the distributional properties of the expert's trajectories, typically in terms of state or state-transition distributions. The UfO paradigm unifies algorithmic and theoretical insights from distribution-matching, model-based learning, adversarial optimization, and reinforcement learning (RL) to enable agents to acquire skills purely “by watching,” analogous to many aspects of human and animal learning.

1. Formal Problem Definition and Foundations

In UfO, the agent is provided with NN expert trajectories, each trajectory a sequence τE=(s0,s1,,sT)\tau^E = (s_0, s_1, \dots, s_T) in a Markov Decision Process (MDP) (S,A,P,r,p0,T)(S, A, P, r, p_0, T). The learner never observes expert actions or the environment's reward function. The key objective is to learn a policy π(as)\pi(a|s) such that the resulting distribution over state transitions β=1Ti=0T1δ(si,si+1)π\beta = \frac{1}{T} \sum_{i=0}^{T-1} \delta_{(s_i, s_{i+1})_\pi} matches the empirical expert transition distribution α=1Ti=0T1δ(si,si+1)E\alpha = \frac{1}{T} \sum_{i=0}^{T-1} \delta_{(s_i, s_{i+1})_E}. This is formalized as minimizing a discrepancy D(α,β)D(\alpha, \beta), where DD is typically a divergence or metric on probability distributions, such as the Kullback-Leibler (KL) divergence, Jensen-Shannon divergence, or Wasserstein distance (Burnwal et al., 20 Sep 2025, Torabi et al., 2019, Chang et al., 2023).

Extensions to visual observation—where the "state" is a sequence of images, possibly in a partially observable setting—require learning representations or sufficient statistics from observation histories, but the core distribution-matching principle remains (Giammarino et al., 2023, Liu et al., 2017).

2. Taxonomy of Algorithmic Approaches

UfO algorithms can be broadly categorized according to their treatment of missing actions, use of model-based components, and choice of distribution-matching objectives:

  • Model-Free Adversarial Approaches: Directly fit a discriminator to distinguish expert from learner state-transition pairs (or raw observations) and use the discriminator output as a surrogate reward for policy optimization. Examples include Generative Adversarial Imitation from Observation (GAIfO) and its variants, which optimize

minπmaxD  E(s,s)α[logD(s,s)]+E(s,s)β[log(1D(s,s))].\min_\pi \max_D \; \mathbb{E}_{(s,s')\sim\alpha}[\log D(s,s')] + \mathbb{E}_{(s,s')\sim\beta}[\log(1-D(s,s'))].

(Torabi et al., 2021, Torabi et al., 2019, Burnwal et al., 20 Sep 2025, Torabi et al., 2019)

An overview taxonomy is provided in Table 1:

Approach Key Mechanism Reward/Objective
GAIfO, LAIfO, DEALIO Adversarial matching Discriminator-based
OT/Score/Flow matching Metric divergence Wasserstein/KL/energy
Model-based BC via IDM Action recovery BC on pseudo-actions
MPC/Planning-based Model-predictive RL Adversarial or flow-based

3. Representative Algorithms and Technical Workflows

Adversarial Imitation from Observation (GAIfO and DEALIO)

In GAIfO (Torabi et al., 2021, Torabi et al., 2019), a discriminator Dθ(s,s)D_\theta(s, s') is trained to distinguish expert from policy-induced state transitions. The agent's policy maximizes expected reward r(s,s)=logDθ(s,s)r(s,s') = \log D_\theta(s,s') (or another monotone function thereof), driving the policy’s transition measure towards the expert’s occupancy.

DEALIO (Torabi et al., 2021) integrates a quadratic-structured discriminator and fitted local linear dynamics with trajectory-optimization (PILQR) for substantial improvements in sample complexity. By formulating the adversarial cost in a form compatible with LQR and path integral updates, DEALIO achieves 4× faster convergence than pure model-free adversarial approaches.

Optimal Transport–Based Imitation

The OT-UfO method (Chang et al., 2023) computes the 1-Wasserstein distance between empirical measures of state transitions, deriving a per-transition reward assignment by decomposing the optimal transport cost. This reward can be seamlessly used with any off-policy RL algorithm (e.g., TD3, DDPG), yielding state-of-the-art performance even with a single expert demonstration.

The core steps are:

  1. Construct empirical transition sets for expert and learner.
  2. Solve for an optimal coupling PP^\star using the Sinkhorn algorithm.
  3. Assign reward rj=ic(xi,yj)Pi,jr_j = -\sum_{i} c(x_i, y_j) P^\star_{i,j} for each transition yjy_j.
  4. Train the policy with this reward in RL.

KL and Energy-Based Divergence Minimization

SOIL-TDM (Boborzi et al., 2022) fits conditional density models for transition dynamics using normalizing flows and minimizes KL divergence between policy-induced and expert transition distributions, providing an analytic stopping criterion and invariant reward assignment.

NEAR (Diwan et al., 24 Jan 2025) builds noise-conditioned energy models of expert state transitions via denoising score matching, anneals the level of smoothing during RL, and avoids adversarial optimization pathologies.

Planning and Model-Predictive Control (MPAIL)

MPAIL (Han et al., 29 Jul 2025) replaces the policy in the adversarial imitation loop with a Model Predictive Path Integral (MPPI) planner, jointly learning cost and value functions via adversarial transitions and solving a KL-regularized trajectory optimization at each episode. The method demonstrates robust out-of-distribution generalization and interpretable, constraint-compatible planning.

4. Theoretical Guarantees and Practical Comparison

Distribution-Matching Guarantees

Under sufficient model and policy expressivity, adversarial state-occupancy matching converges to the expert's transition distribution (Torabi et al., 2019). In the state-only setting, the saddle point of the min-max objective ensures that the occupancies or transition statistics are matched, up to the capacity of the discriminator and the support of demonstration data.

Optimal transport–based approaches provide well-defined metrics even when the learner’s and expert’s supports are disjoint, unlike KL-based objectives, which are not defined for non-overlapping supports (Chang et al., 2023).

LAIfO (Giammarino et al., 2023) shows theoretical bounds for partially observable settings, demonstrating that matching the latent transition distribution in belief-space tightly bounds the learner’s suboptimality with respect to the expert.

Sample Efficiency

Model-based and planning-based methods consistently demonstrate superior sample-efficiency compared to pure model-free adversarial RL, reducing required environment interactions by 2–4× (Torabi et al., 2021, Han et al., 29 Jul 2025, Chang et al., 2023). The use of off-policy updates, analytic divergence minimization, and analytic stopping metrics further reduces learning time and increases practical reliability (Boborzi et al., 2022, Diwan et al., 24 Jan 2025).

Generalization

Stagewise algorithms that decouple proxy transition modeling from behavioral alignment (e.g., (Gavenski et al., 24 Jan 2026)) exhibit improved generalization to unseen initial states, as measured by reduced variance of episodic returns and normalized performance exceeding the teacher in multiple MuJoCo continuous control domains.

5. Empirical Benchmarks and Performance

Modern UfO algorithms have been systematically evaluated on continuous control suites (MuJoCo, PyBullet, DeepMind Control Suite), simulated robotics (Ant, Minitaur, BipedalWalker), visual navigation, and complex high-dimensional tasks such as humanoid locomotion and martial arts motion imitation.

Empirical evaluation consistently uses normalized return, average episodic reward, and coefficient of variation, supplemented by distance/energy alignment metrics, and in some cases, dynamic time-warping pose error or spectral arc-length for motion imitation tasks.

6. Open Challenges and Research Frontiers

Despite considerable advances in algorithmic and empirical performance, the field of UfO faces several significant open problems (Burnwal et al., 20 Sep 2025, Torabi et al., 2019):

  • Partial Observability and Visual Domain Shift: Learning robust invariants for third-person demonstrations, varying viewpoints, and different embodiments requires architectural innovation (e.g., context translation, domain adaptation).
  • Scalability and Optimization: Scaling non-adversarial density models (flows/score-based) to high-dimensional and visual observation spaces remains a challenge.
  • Demonstration Sparsity and Diversity: Reliable performance with few, noisy, or multimodal demonstrations is essential for practical deployment, motivating hybrid and hierarchical policy representations.
  • Safe and Constrained Imitation: Safety guarantees under incomplete demonstration coverage and integration of constraint satisfaction into imitation planning are only beginning to be addressed.
  • Performance Measurement and Theoretical Guarantees: Beyond episode return, more nuanced metrics (e.g., Wasserstein trajectory distance, behavioral alignment scores) are needed for evaluating imitation quality and generalization.
  • Sample-Efficient Real-World Deployment: Reducing sample complexity to within practical limits for real robot learning, especially without simulator assistance, is an ongoing focus, as is sim-to-real transfer under dynamics mismatch.

Addressing these limitations will require further integration of model-based planning, foundation generative models, robust representation learning, and principled off-policy or offline RL optimization.

UfO is deeply interwoven with advances in offline RL, model-based planning, generative representation learning, energy-based modeling, and hierarchical control. Many algorithmic motifs—such as DICE-based divergence optimization, goal-conditioned imitation, and self-supervised skill acquisition—are being adapted between these communities (Burnwal et al., 20 Sep 2025, Torabi et al., 2019).

Future research directions include:

  • Foundation models for universal imitation from large, diverse state-only datasets,
  • Planning-based frameworks incorporating learned constraints and diverse expert behaviors,
  • More interpretable and stable energy- or divergence-based imitation objectives,
  • Advanced metrics and benchmarking for imitation quality, robustness, and safety,
  • Transfer and adaptation across differing environments, morphologies, or multi-agent scenarios.

The field continues to move towards enabling truly general-purpose, robust, and data-efficient policy learning from rich, imperfect observations, without reliance on reward engineering or expert action access.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Unsupervised Imitation Learning from Observation (UfO).