Sinkhorn Imitation Learning

Updated 15 February 2026

Sinkhorn Imitation Learning is a framework that minimizes a Sinkhorn-regularized optimal transport distance between expert and learner occupancy measures.
It leverages an adversarially learned feature space with cosine distance to achieve differentiable and robust alignment of trajectories.
Empirical results demonstrate competitive performance with methods like GAIL and AIRL, especially under limited expert demonstration scenarios.

Sinkhorn Imitation Learning (SIL) defines a tractable adversarial imitation learning (IL) framework in which the policy is optimized to minimize a Sinkhorn-regularized optimal transport (OT) distance between the occupancy measures of expert and learner. Rather than employing fixed metrics over state-action pairs, SIL leverages an adversarially learned feature space with cosine distance. The resulting approach unifies concepts from entropy-regularized inverse reinforcement learning, optimal transport theory, and adversarial imitation learning. SIL yields a differentiable and robust method for aligning non-overlapping occupancy measures, with competitive empirical and theoretical properties (Papagiannis et al., 2020, &&&1&&&).

1. Occupancy Measures and Imitation Learning

In the infinite-horizon, $\gamma$ -discounted Markov Decision Process (MDP) $(\mathcal{S}, \mathcal{A}, P, r, \gamma)$ , a stochastic policy $\pi$ induces an occupancy measure over state-action pairs: $\mu_\pi(s,a) = (1-\gamma)\sum_{t=0}^\infty \gamma^t P_\pi(s_t=s,a_t=a)$ where $P_\pi(s_t=s, a_t=a)$ is the probability of visiting $(s,a)$ at time $t$ under policy $\pi$ . The occupancy measure provides a normalized, $\gamma$ -discounted empirical distribution over the MDP’s state-action space for the given policy.

Imitation learning can be cast as minimizing a statistical distance $D(\mu_E,\mu_\pi)$ between the expert's occupancy measure $\mu_E$ and that induced by the learner $\pi$ . Many classical IL algorithms, such as GAIL and AIRL, can be interpreted as variational or divergence minimization frameworks over these measures.

2. Sinkhorn (Entropic Optimal Transport) Distances

Given discrete probability measures $\mu_E, \mu_\pi \in \mathbb{R}_+^n$ (normalized to sum to 1), a cost matrix $C \in \mathbb{R}^{n\times n}$ , and regularization $\varepsilon>0$ , the entropic-regularized OT (Sinkhorn distance) is given in primal and dual forms:

Primal:

$W_\varepsilon(\mu_E,\mu_\pi) = \min_{\Gamma\in\Pi(\mu_\pi,\mu_E)} \langle \Gamma, C\rangle - \varepsilon H(\Gamma)$

where $\Pi(\mu_\pi,\mu_E)$ denotes the set of couplings with fixed marginals and $H(\Gamma)=-\sum_{ij}\Gamma_{ij}\log\Gamma_{ij}$ .

Dual:

$W_\varepsilon(\mu_E, \mu_\pi) = \max_{u, v \in \mathbb{R}^n} u^\top \mu_\pi + v^\top \mu_E - \varepsilon \sum_{i,j}\exp\Bigl(\frac{u_i+v_j-C_{ij}}{\varepsilon}\Bigr)$

The Sinkhorn algorithm (Sinkhorn-Knopp iterations) provides an efficient solution to the entropic OT problem and approximates both the optimal transport plan $\Gamma^*$ and the Sinkhorn loss. These distances are differentiable and possess $C^\infty$ smoothness whenever the marginals are in the interior of the simplex [(Luise et al., 2018), Thm 3.1].

3. Adversarial Feature Space and Ground Cost Learning

Unlike prior work using fixed ground metrics, SIL defines the cost between state-action samples using a learned feature embedding: $f_\phi: \mathcal{S}\times \mathcal{A} \to \mathbb{R}^d$ with adversarial parameters $\phi$ . The cost between two samples is computed as: $c_\phi((s,a), (s',a')) = 1 - \frac{\langle f_\phi(s,a), f_\phi(s',a')\rangle}{\|f_\phi(s,a)\|_2 \|f_\phi(s',a')\|_2}$ This cosine distance in embedding space yields maximal discriminability when aligning expert and learner trajectories. The feature space is updated adversarially to maximize the Sinkhorn distance, while the policy minimizes it, forming a two-player saddle-point problem.

4. Optimization Objective, Algorithmic Workflow, and Pseudocode

SIL solves the minimax game: $\min_{\theta} \max_{\phi} W_\varepsilon(\mu_E, \mu_{\pi_\theta})|_{C_\phi}$ where $\theta$ are the policy parameters and $\phi$ the feature (cost) parameters.

Critic (Feature) Update:

$\phi \leftarrow \phi + \alpha_{\rm critic} \nabla_\phi W_\varepsilon(\phi, \theta)$

Policy Update:

A per-sample reward proxy is extracted from $\Gamma^*$ : $r_\phi(s,a) = -\sum_{(s',a')\in \mathsf{D}_E} \Gamma^*((s,a),(s',a')) c_\phi((s,a),(s',a'))$ The policy parameters are then updated via TRPO with $r_\phi(s,a)$ as the reward.

Pseudocode for SIL:

for k in range(1, K+1):
    # 1. Collect on-policy rollouts {τ_π}
    # 2. Randomly pair each τ_π with τ_E
    # 3. Build cost-matrix C_φ between all (s,a) and (s',a') via cosine cost
    # 4. Run Sinkhorn to obtain Γ^*, W_ε(φ,θ)
    # 5. Critic update: φ ← φ + α_c ∇_φ W_ε(φ,θ)
    # 6. Compute reward proxy r_φ(s,a) = –∑ Γ^* c_φ
    # 7. Policy update via TRPO using r_φ(s,a): θ ← TRPO(θ, {r_φ})

This scheme ensures adversarially optimal matching between expert and learner rollouts (Papagiannis et al., 2020).

5. Theoretical Properties and Gradient Computation

The Sinkhorn distance is infinitely differentiable on the probability simplex interior, and the gradient with respect to the learner measure can be expressed in closed form. Let $T^* = \arg\min_{T\in\Pi(\alpha,\beta)} \langle T, M\rangle - \varepsilon H(T)$ , the gradient $\nabla_\alpha W_\varepsilon(\alpha, \beta)$ can be computed using linear solvers involving $T^*$ and $M$ [(Luise et al., 2018), Thm 4.1]. Algorithmic stability is inherited from strict convexity and exponential convergence of Sinkhorn iterates in $\varepsilon$ .

SIL is a specific instance of entropy-regularized MaxEnt IRL, with reward regularizer $\mathcal{R}_\phi(r) = -W_\varepsilon(\mu_\pi, \mu_E)|_{C_\phi}$ , and alternates policy and feature-space updates to solve the minimax objective. Asymptotic properties inherited from smooth OT metrics ensure the loss is well-behaved and allows application of stochastic optimization techniques.

6. Experimental Setup and Empirical Analysis

SIL was benchmarked on a suite of MuJoCo continuous control tasks: Hopper-v2, HalfCheetah-v2, Walker2d-v2, Ant-v2, and Humanoid-v2. Expert policies were generated via TRPO and a variable number of expert demonstrations ( $\{2,4,8,16,32\}$ ; for Humanoid $\{8,16,32\}$ ) was used, with all demonstrations subsampled $\times 20$ . Evaluation metrics were:

Cumulative environment reward (withheld from SIL at train time)
Sinkhorn distance (with fixed cosine cost) between learned and expert occupancy

Network architecture:

Policy/value: 2-layer, 128-unit MLP with ReLU
Critic $f_\phi$ : 2-layer, 128-unit MLP; output dimension $\in \{5,10,30\}$ ; learning rate $4$e $-4$ to $9$e $-4$

Baselines: Behavioral Cloning (BC), GAIL, AIRL (on-policy). Findings:

SIL matches or surpasses GAIL/AIRL on the Sinkhorn metric, particularly with limited demos.
Cumulative reward is competitive with state-of-the-art adversarial IL.
Fixed (non-learned) cosine costs significantly underperform, highlighting the necessity of adversarial cost learning (Papagiannis et al., 2020).

7. Limitations and Prospective Extensions

SIL currently operates in an on-policy regime, requiring full trajectory rollouts to form the occupancy and compute $W_\varepsilon$ . Sensitivity to hyperparameters (critic output dimension, regularization $\varepsilon$ , learning rates) is reported. Future directions include extension to off-policy RL, more expressive (sequence-aware) OT coupling, and comprehensive architecture search for the critic feature extractor.

A plausible implication is that expanding the method to off-policy or batch IL settings may require relaxation of the exact computation of occupancy and OT distances, or introduction of temporal structure into the OT couplings. This suggests strong connections to recent advances in temporally coupled OT and off-policy divergence minimization frameworks.

Algorithm	Divergence Metric	Cost Function	Requires Full Trajectories
SIL	Entropic Sinkhorn OT	Learned Cosine (critic)	Yes
GAIL	Jensen-Shannon (GAN)	Learned Discriminator	No (state-action pairs)
AIRL	Maximum Causal Entropy IRL	Learned Reward Function	No (local transitions)

SIL's principal distinction lies in the use of an explicit OT geometry with entropic smoothing, adversarially learned cost, and full-trajectory alignment, as opposed to transition or state-pair matching in GAIL/AIRL (Papagiannis et al., 2020).

Markdown Report Issue Upgrade to Chat

References (2)

Imitation Learning with Sinkhorn Distances (2020)

Differential Properties of Sinkhorn Approximation for Learning with Wasserstein Distance (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sinkhorn Imitation Learning.