Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sinkhorn Imitation Learning

Updated 15 February 2026
  • Sinkhorn Imitation Learning is a framework that minimizes a Sinkhorn-regularized optimal transport distance between expert and learner occupancy measures.
  • It leverages an adversarially learned feature space with cosine distance to achieve differentiable and robust alignment of trajectories.
  • Empirical results demonstrate competitive performance with methods like GAIL and AIRL, especially under limited expert demonstration scenarios.

Sinkhorn Imitation Learning (SIL) defines a tractable adversarial imitation learning (IL) framework in which the policy is optimized to minimize a Sinkhorn-regularized optimal transport (OT) distance between the occupancy measures of expert and learner. Rather than employing fixed metrics over state-action pairs, SIL leverages an adversarially learned feature space with cosine distance. The resulting approach unifies concepts from entropy-regularized inverse reinforcement learning, optimal transport theory, and adversarial imitation learning. SIL yields a differentiable and robust method for aligning non-overlapping occupancy measures, with competitive empirical and theoretical properties (Papagiannis et al., 2020, &&&1&&&).

1. Occupancy Measures and Imitation Learning

In the infinite-horizon, γ\gamma-discounted Markov Decision Process (MDP) (S,A,P,r,γ)(\mathcal{S}, \mathcal{A}, P, r, \gamma), a stochastic policy π\pi induces an occupancy measure over state-action pairs: μπ(s,a)=(1γ)t=0γtPπ(st=s,at=a)\mu_\pi(s,a) = (1-\gamma)\sum_{t=0}^\infty \gamma^t P_\pi(s_t=s,a_t=a) where Pπ(st=s,at=a)P_\pi(s_t=s, a_t=a) is the probability of visiting (s,a)(s,a) at time tt under policy π\pi. The occupancy measure provides a normalized, γ\gamma-discounted empirical distribution over the MDP’s state-action space for the given policy.

Imitation learning can be cast as minimizing a statistical distance D(μE,μπ)D(\mu_E,\mu_\pi) between the expert's occupancy measure μE\mu_E and that induced by the learner π\pi. Many classical IL algorithms, such as GAIL and AIRL, can be interpreted as variational or divergence minimization frameworks over these measures.

2. Sinkhorn (Entropic Optimal Transport) Distances

Given discrete probability measures μE,μπR+n\mu_E, \mu_\pi \in \mathbb{R}_+^n (normalized to sum to 1), a cost matrix CRn×nC \in \mathbb{R}^{n\times n}, and regularization ε>0\varepsilon>0, the entropic-regularized OT (Sinkhorn distance) is given in primal and dual forms:

  • Primal:

Wε(μE,μπ)=minΓΠ(μπ,μE)Γ,CεH(Γ)W_\varepsilon(\mu_E,\mu_\pi) = \min_{\Gamma\in\Pi(\mu_\pi,\mu_E)} \langle \Gamma, C\rangle - \varepsilon H(\Gamma)

where Π(μπ,μE)\Pi(\mu_\pi,\mu_E) denotes the set of couplings with fixed marginals and H(Γ)=ijΓijlogΓijH(\Gamma)=-\sum_{ij}\Gamma_{ij}\log\Gamma_{ij}.

  • Dual:

Wε(μE,μπ)=maxu,vRnuμπ+vμEεi,jexp(ui+vjCijε)W_\varepsilon(\mu_E, \mu_\pi) = \max_{u, v \in \mathbb{R}^n} u^\top \mu_\pi + v^\top \mu_E - \varepsilon \sum_{i,j}\exp\Bigl(\frac{u_i+v_j-C_{ij}}{\varepsilon}\Bigr)

The Sinkhorn algorithm (Sinkhorn-Knopp iterations) provides an efficient solution to the entropic OT problem and approximates both the optimal transport plan Γ\Gamma^* and the Sinkhorn loss. These distances are differentiable and possess CC^\infty smoothness whenever the marginals are in the interior of the simplex [(Luise et al., 2018), Thm 3.1].

3. Adversarial Feature Space and Ground Cost Learning

Unlike prior work using fixed ground metrics, SIL defines the cost between state-action samples using a learned feature embedding: fϕ:S×ARdf_\phi: \mathcal{S}\times \mathcal{A} \to \mathbb{R}^d with adversarial parameters ϕ\phi. The cost between two samples is computed as: cϕ((s,a),(s,a))=1fϕ(s,a),fϕ(s,a)fϕ(s,a)2fϕ(s,a)2c_\phi((s,a), (s',a')) = 1 - \frac{\langle f_\phi(s,a), f_\phi(s',a')\rangle}{\|f_\phi(s,a)\|_2 \|f_\phi(s',a')\|_2} This cosine distance in embedding space yields maximal discriminability when aligning expert and learner trajectories. The feature space is updated adversarially to maximize the Sinkhorn distance, while the policy minimizes it, forming a two-player saddle-point problem.

4. Optimization Objective, Algorithmic Workflow, and Pseudocode

SIL solves the minimax game: minθmaxϕWε(μE,μπθ)Cϕ\min_{\theta} \max_{\phi} W_\varepsilon(\mu_E, \mu_{\pi_\theta})|_{C_\phi} where θ\theta are the policy parameters and ϕ\phi the feature (cost) parameters.

  • Critic (Feature) Update:

ϕϕ+αcriticϕWε(ϕ,θ)\phi \leftarrow \phi + \alpha_{\rm critic} \nabla_\phi W_\varepsilon(\phi, \theta)

  • Policy Update:

A per-sample reward proxy is extracted from Γ\Gamma^*: rϕ(s,a)=(s,a)DEΓ((s,a),(s,a))cϕ((s,a),(s,a))r_\phi(s,a) = -\sum_{(s',a')\in \mathsf{D}_E} \Gamma^*((s,a),(s',a')) c_\phi((s,a),(s',a')) The policy parameters are then updated via TRPO with rϕ(s,a)r_\phi(s,a) as the reward.

Pseudocode for SIL:

1
2
3
4
5
6
7
8
for k in range(1, K+1):
    # 1. Collect on-policy rollouts {τ_π}
    # 2. Randomly pair each τ_π with τ_E
    # 3. Build cost-matrix C_φ between all (s,a) and (s',a') via cosine cost
    # 4. Run Sinkhorn to obtain Γ^*, W_ε(φ,θ)
    # 5. Critic update: φ ← φ + α_c ∇_φ W_ε(φ,θ)
    # 6. Compute reward proxy r_φ(s,a) = –∑ Γ^* c_φ
    # 7. Policy update via TRPO using r_φ(s,a): θ ← TRPO(θ, {r_φ})
This scheme ensures adversarially optimal matching between expert and learner rollouts (Papagiannis et al., 2020).

5. Theoretical Properties and Gradient Computation

The Sinkhorn distance is infinitely differentiable on the probability simplex interior, and the gradient with respect to the learner measure can be expressed in closed form. Let T=argminTΠ(α,β)T,MεH(T)T^* = \arg\min_{T\in\Pi(\alpha,\beta)} \langle T, M\rangle - \varepsilon H(T), the gradient αWε(α,β)\nabla_\alpha W_\varepsilon(\alpha, \beta) can be computed using linear solvers involving TT^* and MM [(Luise et al., 2018), Thm 4.1]. Algorithmic stability is inherited from strict convexity and exponential convergence of Sinkhorn iterates in ε\varepsilon.

SIL is a specific instance of entropy-regularized MaxEnt IRL, with reward regularizer Rϕ(r)=Wε(μπ,μE)Cϕ\mathcal{R}_\phi(r) = -W_\varepsilon(\mu_\pi, \mu_E)|_{C_\phi}, and alternates policy and feature-space updates to solve the minimax objective. Asymptotic properties inherited from smooth OT metrics ensure the loss is well-behaved and allows application of stochastic optimization techniques.

6. Experimental Setup and Empirical Analysis

SIL was benchmarked on a suite of MuJoCo continuous control tasks: Hopper-v2, HalfCheetah-v2, Walker2d-v2, Ant-v2, and Humanoid-v2. Expert policies were generated via TRPO and a variable number of expert demonstrations ({2,4,8,16,32}\{2,4,8,16,32\}; for Humanoid {8,16,32}\{8,16,32\}) was used, with all demonstrations subsampled ×20\times 20. Evaluation metrics were:

  • Cumulative environment reward (withheld from SIL at train time)
  • Sinkhorn distance (with fixed cosine cost) between learned and expert occupancy

Network architecture:

  • Policy/value: 2-layer, 128-unit MLP with ReLU
  • Critic fϕf_\phi: 2-layer, 128-unit MLP; output dimension {5,10,30}\in \{5,10,30\}; learning rate $4$e4-4 to $9$e4-4

Baselines: Behavioral Cloning (BC), GAIL, AIRL (on-policy). Findings:

  • SIL matches or surpasses GAIL/AIRL on the Sinkhorn metric, particularly with limited demos.
  • Cumulative reward is competitive with state-of-the-art adversarial IL.
  • Fixed (non-learned) cosine costs significantly underperform, highlighting the necessity of adversarial cost learning (Papagiannis et al., 2020).

7. Limitations and Prospective Extensions

SIL currently operates in an on-policy regime, requiring full trajectory rollouts to form the occupancy and compute WεW_\varepsilon. Sensitivity to hyperparameters (critic output dimension, regularization ε\varepsilon, learning rates) is reported. Future directions include extension to off-policy RL, more expressive (sequence-aware) OT coupling, and comprehensive architecture search for the critic feature extractor.

A plausible implication is that expanding the method to off-policy or batch IL settings may require relaxation of the exact computation of occupancy and OT distances, or introduction of temporal structure into the OT couplings. This suggests strong connections to recent advances in temporally coupled OT and off-policy divergence minimization frameworks.

Algorithm Divergence Metric Cost Function Requires Full Trajectories
SIL Entropic Sinkhorn OT Learned Cosine (critic) Yes
GAIL Jensen-Shannon (GAN) Learned Discriminator No (state-action pairs)
AIRL Maximum Causal Entropy IRL Learned Reward Function No (local transitions)

SIL's principal distinction lies in the use of an explicit OT geometry with entropic smoothing, adversarially learned cost, and full-trajectory alignment, as opposed to transition or state-pair matching in GAIL/AIRL (Papagiannis et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sinkhorn Imitation Learning.