DisentanGAIL: Domain-Robust Imitation Learning
- DisentanGAIL is an adversarial framework for domain-robust visual imitation learning that leverages stochastic latent bottlenecks and mutual information regularization.
- It learns domain-invariant policy representations using only high-dimensional observational data and off-policy SAC, bypassing the need for expert action information.
- Empirical evaluations in MuJoCo environments show near-expert performance across low- and high-dimensional tasks, underlining its scalability in diverse visual domains.
Disentangling Generative Adversarial Imitation Learning (DisentanGAIL) is an adversarial framework for domain-robust visual imitation learning under observational constraints. The method enables autonomous agents to mimic expert behavior using only high-dimensional observations (e.g., image trajectories) from demonstration sources that may differ systematically in appearance, viewpoint, or embodiment. Unlike conventional imitation learning paradigms, DisentanGAIL operates without access to expert states or actions and instead learns robust, task-relevant, domain-invariant representations by regularizing the discriminator through mutual information constraints. This approach facilitates generalization across visual and morphological domain shifts, yielding competitive performance in both low- and high-dimensional continuous control settings (Cetin et al., 2021).
1. Observational Imitation Formulation
DisentanGAIL addresses observational imitation: the learning agent receives only observational sequences from an expert, with the expert demonstrations and agent experiences formally denoted as and , where , . Critically, expert and agent operate within different partially observable Markov decision processes (POMDPs):
with distinct observation and action spaces (), sharing a latent true reward function . The challenge is to recover an agent policy in that maximizes true reward, relying solely on , even under large appearance and embodiment shifts.
2. DisentanGAIL Architectural Components
DisentanGAIL generalizes the GAIL architecture by introducing a stochastic latent bottleneck and mutual information regularization in the discriminator.
- Policy Generator (): The policy is parameterized and trained using off-policy Soft Actor-Critic (SAC). The agent has access to its state for control.
- Discriminator (): This comprises two components:
- Preprocessor (): A convolutional encoder produces Gaussian parameters for each observation , yielding stochastic latent representations . The latent encodes goal-completion, not domain-specific features.
- Invariant Discriminator (): Four consecutive latents are concatenated and input to an MLP, estimating for multi-frame domain-invariant evaluation.
- Mutual Information Estimator (): Utilizes two independent statistics networks (MINE) to estimate lower bounds on , taking the maximum to avoid local minima ("double statistics network").
3. Mutual Information Constraints
Two mutual information regularization mechanisms enforce domain invariance of latent representations:
- Expert Demonstration Constraint: For data , the mutual information is capped below 1 bit:
preventing trivial domain separation by appearance.
- Prior Data Constraint: On combined unsolved prior data (agent and expert trajectories from random policies), the mutual information is pushed toward zero:
yielding a mapping that further reduces spurious domain-specific differences.
These constraints are imposed via adaptive and Lagrangian penalties in the discriminator objective.
4. Objective Functions and Optimization Process
DisentanGAIL aggregates several learning signals:
- GAIL Adversarial Loss:
$J_G(\theta) = \E_{B_E,z\sim P_{\theta_1}}[\log S_{\theta_2}(\hat z)] + \E_{B_\pi,z\sim P_{\theta_1}}[\log(1 - S_{\theta_2}(\hat z))]$
- Policy Pseudo-Reward:
- SAC-Style Policy Objective:
$J(\omega) = \E_{\tau\sim p_{\pi_\omega}}\left[ \sum_{t=0}^T (R_D(o_{t:t-3}) - \alpha \log \pi_\omega(a_t|s_t)) \right]$
- Mutual Information Lower Bound (MINE):
$I_\phi(Z; D) \geq \E_{P(z,d)}[T_\phi(z,d)] - \log\left( \E_{P(z)P(d)}[e^{T_\phi(z,d)}] \right)$
- Regularized Discriminator Objective:
A three-stage optimization is used per epoch, alternating updates for the discriminator (with mutual information penalties and adaptive step sizes for regularization strengths), mutual information estimators, and the agent policy (off-policy SAC). 1-Lipschitz continuity on is enforced by spectral normalization. The double-MINE approach is utilized to prevent domain information disguising in the latent bottleneck.
5. Empirical Evaluation and Comparative Performance
DisentanGAIL was evaluated across six MuJoCo environment realms, each featuring systematic domain shifts:
| Environment Realm | Morphological/Visual Changes | DisentanGAIL Perf. | Baseline (TPIL) Perf. |
|---|---|---|---|
| Inverted Pendulum/Reacher (low-dim) | Link number, color, camera angle | 0.94–1.02 | 0.19–0.81 |
| Hopper/Striker/etc. (high-dim) | Joint configuration, backgrounds | 0.71–0.92 | 0.06–0.36 |
Protocols involved training experts in source domains and learning with DisentanGAIL in target domains over 20 K agent steps per epoch, measuring normalized return (0=random, 1=expert) over five seeds. DisentanGAIL consistently achieved near-expert performance (~95–100%) in low-dimensional settings and strong results (up to 92%) in high-dimensional domains, outperforming alternative methods including TPIL, domain confusion loss, and versions of GAIL lacking mutual information regularization. Even with drastic changes in embodiment or appearance, DisentanGAIL produced viable controllers (Cetin et al., 2021).
6. Algorithmic Insights and Ablation Findings
Empirical ablations clarify key design choices within DisentanGAIL:
- Prior Data Role: Removing prior-data constraints significantly degrades robustness under pronounced visual shifts; prior data supplies unsupervised domain alignment.
- Constraint Strength: Excessively tight mutual information bounds ( bits) impede encoding of goal-progress features, stalling learning; a loose cap (~1 bit) strikes an optimal balance.
- Disguising Prevention: Eliminating either spectral normalization or the double-MINE scheme reduces stability and performance by 5–10%, indicating their necessity.
- Latent Coupling Quality: Minimal L1 matching of agent and expert latents yields interpretable, goal-progress-aligned pairings, effectively suppressing domain-specific noise. Baseline domain confusion losses produce less structured, noisy couplings.
- Background Robustness: Even under extreme background changes, performance remains high (converging to 80+% of expert), with only mild convergence slowdown.
A plausible implication is that mutual information regularization, when properly balanced, is sufficient for robust cross-domain alignment in visual imitation settings.
7. Domain Impact and Research Connections
DisentanGAIL establishes a domain-robust paradigm for observational imitation learning, enabling state-of-the-art transfer across domain shifts previously problematic for visual RL agents. The framework enriches adversarial imitation approaches by integrating stochastic latent bottlenecks and distinct mutual information constraints, yielding discriminators that focus on task progression rather than domain origin. In conjunction with off-policy entropy-regularized learning, DisentanGAIL’s methodology demonstrates scalable, robust performance in problems ranging from simple pendulum control to complex high-dimensional robotic manipulation and locomotion. The approach expands the scope of visual imitation learning to scenarios where action and state alignment with experts is infeasible, suggesting broad applicability in robotics, autonomous navigation, and embodied AI with mismatched sensorimotor configurations (Cetin et al., 2021).