SAIL: Self-Supervised Adversarial Imitation Learning

Updated 2 February 2026

SAIL is a framework that combines self-supervised and adversarial learning to imitate expert behaviors without full action labels.
It employs techniques like inverse-dynamics for state-only data and discriminator-guided curriculum learning to reduce expert sample complexity.
SAIL achieves improved stability and sample efficiency, demonstrating robust performance across control and vision-based benchmarks.

Self-Supervised Adversarial Imitation Learning (SAIL) refers to a family of imitation learning algorithms that integrate self-supervised auxiliary signals with adversarial learning to efficiently acquire expert-like behaviors in sequential decision-making settings where expert action labels or reward functions are unavailable or limited. SAIL advances the Generative Adversarial Imitation Learning (GAIL) paradigm by leveraging state-only expert trajectories, exploiting structure in unlabelled data, or learning compact, invariant representations, thereby reducing the expert sample complexity and improving the stability and scalability of imitation learning procedures (Monteiro et al., 2023, Wang et al., 2024, Jung et al., 2023, Li et al., 2022, Dharmavaram et al., 2021).

1. Problem Setting and Motivation

The canonical SAIL framework is defined over a Markov Decision Process (MDP) $M=(S, A, T, r, \gamma)$ , where $S$ is the state space, $A$ is the action space, $T(s' | s, a)$ is the (unknown) transition kernel, $r(s,a)$ denotes the reward function (typically unobserved), and $\gamma$ is the discount factor. Unlike classical behavioral cloning, which requires access to $(s,a)$ expert pairs, SAIL settings frequently assume either state-only expert demonstrations $\mathcal{T}^e =\left\{\zeta^e_i=(s^e_{0},...,s^e_{N_i})\right\}_{i=1}^E$ or high-dimensional observations (e.g., images) $\mathcal{D}_e = \{\bm v^e_i\}$ . The objective is to learn a policy $\pi_\theta$ such that its induced state (or observation) occupancy matches that of the expert, without access to rewards or full action supervision (Monteiro et al., 2023, Wang et al., 2024).

SAIL is motivated by two limitations of prior imitation learning approaches:

Local minima and instability: Iterative self-supervised schemes (e.g., purely inverse-dynamics–based LfO) are susceptible to getting trapped in poor minima, e.g., performing “no action” until episode halts (Monteiro et al., 2023).
High sample complexity: Classical AIL requires large numbers of expert trajectories to learn discriminative policies, especially in high-dimensional or partially observed settings (Jung et al., 2023).

2. Algorithmic Structure and Optimization

SAIL approaches interleave self-supervised and adversarial learning objectives. The general SAIL algorithmic loop is characterized as follows:

Self-Supervised Action Decoding (for State-only LfO)

When actions are unavailable, SAIL employs an inverse-dynamics model $\mathcal{M}_\phi(s_t,s_{t+1}) \mapsto \hat p(a | s_t,s_{t+1})$ , learned by minimizing the cross-entropy loss

$\mathcal{L}_{\rm decode}(\phi) = -\frac{1}{|I^s|}\sum_{(s_t,a_t,s_{t+1})\in I^s} \log \mathcal{M}_\phi(a_t|s_t,s_{t+1}),$

where $I^s$ is a buffer of (self-supervised) labelled agent transitions. The model is periodically used to infer plausible pseudo-actions $\hat a^e_t$ for expert state-pairs, supplying target labels for behavioral cloning of $\pi_\theta$ (Monteiro et al., 2023).

Adversarial Discriminator Integration

An adversarial discriminator $D_\psi$ is trained to distinguish between expert transitions and agent-generated ones, typically on state (or observation) pairs. For state-only settings, the GAN-style loss takes the form: $\mathcal{L}_D(\psi) = -\mathbb{E}_{(s,s')\sim \pi_E}\left[\log D_\psi(s,s')\right] - \mathbb{E}_{(s,s')\sim \pi_\theta}\left[\log(1 - D_\psi(s,s'))\right],$ with the corresponding policy loss

$\mathcal{L}_{\rm GAN}(\theta) = \mathbb{E}_{(s,s')\sim \pi_\theta}\left[-\log D_\psi(s,s')\right].$

This adversarial pathway both guides the policy and prunes non-expert-like transitions from the buffer, forming an automatic curriculum and densifying reward signals (Monteiro et al., 2023).

Integrated Objective and Training

The complete loss interleaves the self-supervised and adversarial paths as

$\mathcal{L}(\theta,\phi) = \lambda_1 \mathcal{L}_{\rm decode}(\phi) + \lambda_2 \mathcal{L}_{\rm GAN}(\theta),$

with optional forward-dynamics or other regularizers. Hyperparameter selection (e.g., grid search) maintains balance between the loss terms (Monteiro et al., 2023).

High-Level Pseudocode

The main training loop typically comprises (i) self-supervised inverse-dynamics updates, (ii) action decoding for expert state pairs, (iii) policy updates using the decoded dataset, (iv) discriminator training, and (v) pruning/fusing agent rollouts for future decoding. Enhanced variants (e.g., for high-dimensional or multi-modal data) add distributional regularization, contrastive representation learning, and latent skill or interaction graph conditioning (Wang et al., 2024, Li et al., 2022, Dharmavaram et al., 2021).

3. Theoretical Justification and Properties

SAIL methods rely on several theoretical advantages, supported by formal and empirical analysis:

Curriculum via Discriminator Filtering: By only retaining transitions that are difficult for the discriminator to distinguish from expert, SAIL admits a form of adaptive curriculum, addressing stagnation and suboptimality in purely self-supervised updates (Monteiro et al., 2023).
Dense, Shaping Reward Flow: The gradient signal supplied by the adversarial discriminator is temporally dense and provides informative shaping to the policy, addressing reward sparsity and “no-action” traps (Monteiro et al., 2023).
Sample Efficiency via Self-Supervision: Learning distortion-invariant, predictive feature manifolds through self-supervised objectives reduces the effective VC dimension of the discriminator, thereby lowering the expert sample complexity required for successful adversarial matching (Jung et al., 2023).
Formal Connections to Cost-Regularized Apprenticeship Learning: SAIL variants that employ self-supervised regression analysis of interpolated expert–agent trajectories can be formulated as instances of cost-regularized apprenticeship learning with a learned regularizer that incorporates both adversarial and self-supervised shaping penalties (Dharmavaram et al., 2021).

Visual and High-Dimensional Observations

In vision-based imitation settings, SAIL leverages representation learning—combining adversarial GAIL objectives with (i) unsupervised contrastive loss (InfoNCE) to shape replay-buffer representations, and (ii) supervised and calibrated contrastive losses to cluster expert images while adaptively softening separation from high-quality agent samples. The overall objective is

$\min_{\theta,f} \max_{h_d} \mathcal{L}_{\rm dis} + \lambda_1 \mathcal{L}_{\rm UnSupCon} + \lambda_2 \mathcal{L}_{\rm C\text{-}SupCon}$

and produces a highly discriminative, stable image encoder for downstream imitation policy learning (Wang et al., 2024).

In settings with unlabeled, mixed-behavior trajectories, SAIL approaches (e.g., CASSI) incorporate latent skill variables $z \in \{0, ..., N^z-1\}$ and augment adversarial imitation rewards with mutual-information-based skill diversity terms. The policy $\pi_\theta(a|s, z)$ is thereby induced to produce an expressive, diverse repertoire of behaviors, discovered in a self-supervised manner without explicit subskill labels (Li et al., 2022).

Multi-Agent Imitation

Graph-based multi-agent SAIL (e.g., SS-MAIL) integrates self-supervised trajectory interpolation objectives in the discriminator, while employing dynamic interaction graphs and centralized soft actor-critic updates for agent policies. An exponential teacher-forcing curriculum, “Trajectory Forcing,” smoothly transitions from pure expert imitation to policy-generated rollouts, improving convergence, stability, and reward shaping in multi-agent coordinated tasks (Dharmavaram et al., 2021).

5. Empirical Results and Benchmarks

SAIL demonstrates consistently superior performance and sample efficiency across classic control, continuous-control, and vision-based benchmarks:

Benchmark	Random	Expert	BCO/BC	GAIL	SAIL
CartPole-v1	21.9 ± 0.0	500 ± 0.0	218.5 ± 160.7	302.0 ± 158.9	500 ± 0.0
MountainCar-v0	–200 ± 0.0	–98.0 ± 8.2	–102.1 ± 4.2	–200.0 ± 0.0	–99.4 ± 1.8
Acrobot-v1	–499.4 ± 0.0	–74.9 ± 8.6	–80.2 ± 3.6	–274.3 ± 116.9	–78.8 ± 0.4
LunarLander-v2	–170.5 ± 0.0	256.8 ± 21.4	63.1 ± 79.5	120.2 ± 28.0	183.6 ± 5.6

On MuJoCo control with only 100 expert state–action pairs, SAIL achieves a 39% average improvement over standard GAIL; in complex pixel-based DMControl settings, SAIL outperforms GAIL baselines in both sample efficiency and final return (Monteiro et al., 2023, Jung et al., 2023, Wang et al., 2024).

SAIL with mutually informative skill regularization produces maximal skill diversity and fidelity, as measured by oracle classifier entropy and negative cross-entropy, and skill-conditioned policies transfer to real robot systems without finetuning (Li et al., 2022).

6. Implementation Details and Engineering

Successful SAIL pipelines share characteristic architectural and training choices:

Encoders/Policies: Shallow MLPs or convolutional encoders for tabular/visual data, often with attention, layer normalization, and Adam variants (Monteiro et al., 2023, Wang et al., 2024).
Discriminators: LSTM-based or MLP-based, sometimes with dropout, trained on state, observation, or (state,action) pairs (Monteiro et al., 2023).
Replay Buffers: Persistent buffers for self-supervised pseudo-labeling; transition selection filtered by discriminator thresholds to maintain curriculum (Monteiro et al., 2023).
Contrastive/Corruption Strategies: For representation learning, in-distribution swapping for non-image data, random shifts for images, and Barlow Twins decorrelation for action representations (Jung et al., 2023, Wang et al., 2024).
Policy Optimization: Behavioral cloning (state-only SAIL); off-policy RL (e.g., DDPG, SAC, TRPO) for adversarial policy updates; curriculum schedules or “Trajectory Forcing” to anneal from BC to RL (Dharmavaram et al., 2021).

The primary computational overhead arises from maintaining additional auxiliary networks—encoders, forward models, and skill or contrastive heads.

7. Limitations and Outlook

Limitations of SAIL approaches include:

Architectural/Compute Complexity: Multiple auxiliary networks increase computational demand relative to baseline IL methods (Jung et al., 2023).
Design Sensitivity: Performance can be sensitive to corruption rates, buffer filtering policies, horizon-length hyperparameters, and skill observation mappings (Li et al., 2022).
Lack of Formal Convergence Guarantees: While providing empirical justification and theoretical reductions in sample complexity or regularization, formal proofs of convergence for SAIL in general settings remain open (Monteiro et al., 2023).
Scope of Applicability: The discrete latent skill approach may not capture smoothly parameterized motion families; further, the requirements for informative imitation or skill mapping design are domain-dependent (Li et al., 2022).

Future research directions include lighter-weight auxiliary objectives, integration with offline datasets, continuous latent skill spaces, hierarchical reinforcement learning applications, and more principled theory for dynamics-aware invariances (Monteiro et al., 2023, Jung et al., 2023, Li et al., 2022).

SAIL thus formalizes a suite of methods that unite adversarial and self-supervised learning—across state-only, pixel-based, multi-modal, and multi-agent domains—to deliver stable, sample-efficient imitation by tightly coupling representation, reward shaping, and policy optimization (Monteiro et al., 2023, Wang et al., 2024, Jung et al., 2023, Li et al., 2022, Dharmavaram et al., 2021).