Sinkhorn Imitation Learning
- Sinkhorn Imitation Learning is a framework that minimizes a Sinkhorn-regularized optimal transport distance between expert and learner occupancy measures.
- It leverages an adversarially learned feature space with cosine distance to achieve differentiable and robust alignment of trajectories.
- Empirical results demonstrate competitive performance with methods like GAIL and AIRL, especially under limited expert demonstration scenarios.
Sinkhorn Imitation Learning (SIL) defines a tractable adversarial imitation learning (IL) framework in which the policy is optimized to minimize a Sinkhorn-regularized optimal transport (OT) distance between the occupancy measures of expert and learner. Rather than employing fixed metrics over state-action pairs, SIL leverages an adversarially learned feature space with cosine distance. The resulting approach unifies concepts from entropy-regularized inverse reinforcement learning, optimal transport theory, and adversarial imitation learning. SIL yields a differentiable and robust method for aligning non-overlapping occupancy measures, with competitive empirical and theoretical properties (Papagiannis et al., 2020, &&&1&&&).
1. Occupancy Measures and Imitation Learning
In the infinite-horizon, -discounted Markov Decision Process (MDP) , a stochastic policy induces an occupancy measure over state-action pairs: where is the probability of visiting at time under policy . The occupancy measure provides a normalized, -discounted empirical distribution over the MDP’s state-action space for the given policy.
Imitation learning can be cast as minimizing a statistical distance between the expert's occupancy measure and that induced by the learner . Many classical IL algorithms, such as GAIL and AIRL, can be interpreted as variational or divergence minimization frameworks over these measures.
2. Sinkhorn (Entropic Optimal Transport) Distances
Given discrete probability measures (normalized to sum to 1), a cost matrix , and regularization , the entropic-regularized OT (Sinkhorn distance) is given in primal and dual forms:
- Primal:
where denotes the set of couplings with fixed marginals and .
- Dual:
The Sinkhorn algorithm (Sinkhorn-Knopp iterations) provides an efficient solution to the entropic OT problem and approximates both the optimal transport plan and the Sinkhorn loss. These distances are differentiable and possess smoothness whenever the marginals are in the interior of the simplex [(Luise et al., 2018), Thm 3.1].
3. Adversarial Feature Space and Ground Cost Learning
Unlike prior work using fixed ground metrics, SIL defines the cost between state-action samples using a learned feature embedding: with adversarial parameters . The cost between two samples is computed as: This cosine distance in embedding space yields maximal discriminability when aligning expert and learner trajectories. The feature space is updated adversarially to maximize the Sinkhorn distance, while the policy minimizes it, forming a two-player saddle-point problem.
4. Optimization Objective, Algorithmic Workflow, and Pseudocode
SIL solves the minimax game: where are the policy parameters and the feature (cost) parameters.
- Critic (Feature) Update:
- Policy Update:
A per-sample reward proxy is extracted from : The policy parameters are then updated via TRPO with as the reward.
Pseudocode for SIL:
1 2 3 4 5 6 7 8 |
for k in range(1, K+1): # 1. Collect on-policy rollouts {τ_π} # 2. Randomly pair each τ_π with τ_E # 3. Build cost-matrix C_φ between all (s,a) and (s',a') via cosine cost # 4. Run Sinkhorn to obtain Γ^*, W_ε(φ,θ) # 5. Critic update: φ ← φ + α_c ∇_φ W_ε(φ,θ) # 6. Compute reward proxy r_φ(s,a) = –∑ Γ^* c_φ # 7. Policy update via TRPO using r_φ(s,a): θ ← TRPO(θ, {r_φ}) |
5. Theoretical Properties and Gradient Computation
The Sinkhorn distance is infinitely differentiable on the probability simplex interior, and the gradient with respect to the learner measure can be expressed in closed form. Let , the gradient can be computed using linear solvers involving and [(Luise et al., 2018), Thm 4.1]. Algorithmic stability is inherited from strict convexity and exponential convergence of Sinkhorn iterates in .
SIL is a specific instance of entropy-regularized MaxEnt IRL, with reward regularizer , and alternates policy and feature-space updates to solve the minimax objective. Asymptotic properties inherited from smooth OT metrics ensure the loss is well-behaved and allows application of stochastic optimization techniques.
6. Experimental Setup and Empirical Analysis
SIL was benchmarked on a suite of MuJoCo continuous control tasks: Hopper-v2, HalfCheetah-v2, Walker2d-v2, Ant-v2, and Humanoid-v2. Expert policies were generated via TRPO and a variable number of expert demonstrations (; for Humanoid ) was used, with all demonstrations subsampled . Evaluation metrics were:
- Cumulative environment reward (withheld from SIL at train time)
- Sinkhorn distance (with fixed cosine cost) between learned and expert occupancy
Network architecture:
- Policy/value: 2-layer, 128-unit MLP with ReLU
- Critic : 2-layer, 128-unit MLP; output dimension ; learning rate $4$e to $9$e
Baselines: Behavioral Cloning (BC), GAIL, AIRL (on-policy). Findings:
- SIL matches or surpasses GAIL/AIRL on the Sinkhorn metric, particularly with limited demos.
- Cumulative reward is competitive with state-of-the-art adversarial IL.
- Fixed (non-learned) cosine costs significantly underperform, highlighting the necessity of adversarial cost learning (Papagiannis et al., 2020).
7. Limitations and Prospective Extensions
SIL currently operates in an on-policy regime, requiring full trajectory rollouts to form the occupancy and compute . Sensitivity to hyperparameters (critic output dimension, regularization , learning rates) is reported. Future directions include extension to off-policy RL, more expressive (sequence-aware) OT coupling, and comprehensive architecture search for the critic feature extractor.
A plausible implication is that expanding the method to off-policy or batch IL settings may require relaxation of the exact computation of occupancy and OT distances, or introduction of temporal structure into the OT couplings. This suggests strong connections to recent advances in temporally coupled OT and off-policy divergence minimization frameworks.
Table: Comparison of SIL and Related Imitation Learning Approaches
| Algorithm | Divergence Metric | Cost Function | Requires Full Trajectories |
|---|---|---|---|
| SIL | Entropic Sinkhorn OT | Learned Cosine (critic) | Yes |
| GAIL | Jensen-Shannon (GAN) | Learned Discriminator | No (state-action pairs) |
| AIRL | Maximum Causal Entropy IRL | Learned Reward Function | No (local transitions) |
SIL's principal distinction lies in the use of an explicit OT geometry with entropic smoothing, adversarially learned cost, and full-trajectory alignment, as opposed to transition or state-pair matching in GAIL/AIRL (Papagiannis et al., 2020).