Papers
Topics
Authors
Recent
Search
2000 character limit reached

CRSAIL: Conformalized Rejection Sampling for AIL

Updated 6 December 2025
  • The paper introduces CRSAIL, a query-efficient algorithm that uses conformal prediction and novelty metrics to selectively query expert demonstrations.
  • Methodologically, CRSAIL employs a k-nearest neighbor state novelty score and a globally calibrated threshold to maintain rigorous query rate control.
  • Empirical results on MuJoCo benchmarks demonstrate up to 96% query reduction compared to DAgger while achieving expert-level performance.

Conformalized Rejection Sampling for Active Imitation Learning (CRSAIL) is a query-efficient algorithm for active imitation learning (AIL) that leverages geometric state-space novelty and conformal prediction to select which states require expert demonstration. By coupling nearest-neighbor-based novelty assessment with a globally calibrated threshold, CRSAIL enables principled, distribution-free control of expert labeling budgets, significantly reducing query costs compared to methods such as DAgger while maintaining or exceeding expert-level performance (Firouzkouhi et al., 29 Nov 2025).

1. Problem Formulation and Motivation

Imitation learning is framed within an unknown Markov Decision Process (MDP),

M=(X,U,P,r,X0,XT,Tmax),M = (\mathcal{X}, \mathcal{U}, P, r, \mathcal{X}_0, \mathcal{X}_T, T_{\max}),

where XRd\mathcal{X} \subset \mathbb{R}^d is the state space, U\mathcal{U} is the action space, PP the transition kernel, rr the reward (used for evaluation only), and episodes terminate on XT\mathcal{X}_T or at TmaxT_{\max}. The expert policy πE\pi_E provides demonstration pairs (x,uE)(x, u_E), each incurring a unit query cost. A parametric learner πθ\pi_\theta is trained to minimize the on-policy imitation loss,

XRd\mathcal{X} \subset \mathbb{R}^d0

with trajectories generated under XRd\mathcal{X} \subset \mathbb{R}^d1. Pure behavior cloning suffers from covariate shift, leading to compounding errors when XRd\mathcal{X} \subset \mathbb{R}^d2 visits underrepresented states. AIL mitigates this by querying the expert selectively, but query cost is often dominated by the cost per demonstration, especially in GPU-intensive, human-in-the-loop, or repetitive state settings.

Existing techniques such as DAgger query too frequently, while others require real-time interventions or rely on action uncertainty, which does not always capture state-space novelty. CRSAIL addresses this by quantifying geometric novelty and applying conformal prediction to control the query rate through a single, globally calibrated threshold.

2. Novelty Quantification: XRd\mathcal{X} \subset \mathbb{R}^d3-th Nearest Neighbor Distance

The core of CRSAIL is its state-space novelty score based on the distance to the XRd\mathcal{X} \subset \mathbb{R}^d4-th nearest neighbor in the expert-labeled state set. For episode XRd\mathcal{X} \subset \mathbb{R}^d5, let XRd\mathcal{X} \subset \mathbb{R}^d6 denote all expert-labeled state-action pairs, and XRd\mathcal{X} \subset \mathbb{R}^d7 the projection onto states. For query state XRd\mathcal{X} \subset \mathbb{R}^d8, the nonconformity (novelty) score is defined as:

XRd\mathcal{X} \subset \mathbb{R}^d9

where U\mathcal{U}0. This score measures the radius of the smallest Euclidean ball centered at U\mathcal{U}1 encompassing at least U\mathcal{U}2 expert states. High U\mathcal{U}3 values indicate state-space sparsity, guiding the algorithm to query only in underrepresented regions.

3. Conformal Calibration: Distribution-Free Threshold Selection

CRSAIL sets a single query threshold U\mathcal{U}4 via conformal prediction, enabling rigorous statistical control of the expected query rate U\mathcal{U}5. Calibration proceeds as follows:

  1. Calibration Dataset: Execute U\mathcal{U}6 episodes under the initial behavior-cloned policy U\mathcal{U}7, collecting all visited states into U\mathcal{U}8.
  2. Score Computation: For each U\mathcal{U}9, calculate PP0.
  3. Threshold Selection: Define PP1 and set PP2, with PP3 the sorted scores.

Under exchangeability, the conformal guarantee ensures:

PP4

so at most an PP5 fraction of new states will be queried in expectation. Larger PP6 lowers PP7 and increases the nominal query rate; smaller PP8 raises PP9 and decreases queries. This statistic is robust to outliers due to the high quantile selection and the rr0th-neighbor scoring.

4. The CRSAIL Algorithm and Computational Properties

CRSAIL alternates between closed-loop rollouts, batch (post hoc) expert queries, dataset aggregation, and policy updates, governed by budgets for total environment steps rr1 and queries rr2. The protocol is:

Step 1: Radius Calibration

  • Roll out rr3 for rr4 episodes to collect rr5.
  • Compute novelty scores rr6 for all rr7.
  • Set rr8, where rr9.

Step 2: Iterative Training

  • At iteration XT\mathcal{X}_T0, roll out XT\mathcal{X}_T1, obtaining a trajectory XT\mathcal{X}_T2.
  • Query the expert at XT\mathcal{X}_T3 if XT\mathcal{X}_T4, forming XT\mathcal{X}_T5.
  • Aggregate: XT\mathcal{X}_T6.
  • Update policy: XT\mathcal{X}_T7.
  • Increment counters and repeat until budgets are exhausted.

Naive complexity per episode is XT\mathcal{X}_T8 for distance computations and XT\mathcal{X}_T9 for TmaxT_{\max}0-nearest selection. Batch computation and small TmaxT_{\max}1 (e.g., TmaxT_{\max}2) render this overhead minor compared to environment simulation and policy optimization.

5. Theoretical Guarantees and Hyperparameter Robustness

The conformally calibrated threshold TmaxT_{\max}3 provides finite-sample coverage: under exchangeability, at most an TmaxT_{\max}4 fraction of new states should trigger queries in expectation. CRSAIL’s query rate exhibits monotonic dependence on TmaxT_{\max}5 and is robust to both TmaxT_{\max}6 and TmaxT_{\max}7. Empirically, setting TmaxT_{\max}8 in TmaxT_{\max}9 balances query efficiency and convergence rate. CRSAIL is less sensitive to πE\pi_E0 than action-uncertainty–based AIL, and all πE\pi_E1 yielded robust convergence on benchmarks, with πE\pi_E2 suggested as a default.

6. Empirical Results on MuJoCo Robotics Benchmarks

CRSAIL was evaluated on MuJoCo environments—Inverted Double Pendulum, Pusher, and Hopper—against DAgger, EnsembleDAgger, and ThriftyDAgger using metrics such as convergence rate, queries to convergence, and total queries. Key findings averaged over five offline datasets and all πE\pi_E3 values include:

  • Inverted Double Pendulum: CRSAIL reduced queries by ~96% versus DAgger, and ~65% versus the best prior method.
  • Pusher: ~72% fewer queries than DAgger, ~48% fewer than the best prior.
  • Hopper: Still competitive, outperforming ThriftyDAgger in total queries and matching or exceeding EnsembleDAgger in efficiency.

Empirical query rates closely tracked πE\pi_E4 across all tasks, affirming the efficacy of conformal calibration in distribution-free query rate control.

Environment Query Savings vs DAgger Query Savings vs Best Prior Notes
Inverted Double Pendulum ~96% ~65% 100% convergence; robust across πE\pi_E5, πE\pi_E6
Pusher ~72% ~48% 100% convergence; insensitive to πE\pi_E7
Hopper Competitive Superior to ThriftyDAgger Task harder; state-space novelty less effective

7. Discussion, Limitations, and Future Extensions

Advantages:

CRSAIL eliminates the need for real-time expert takeovers or action-uncertainty gating by adopting batch, post hoc querying. Its principled, distribution-free thresholding via conformal prediction exposes πE\pi_E8 as a global, easily interpretable control for expert query budgets. State-space novelty avoids conflating aleatoric and epistemic uncertainty and requires no auxiliary networks or complex estimators. Hyperparameter robustness to both πE\pi_E9 and (x,uE)(x, u_E)0 was observed empirically across all evaluated domains.

Limitations:

Conformal guarantees require exchangeability, which may be violated as (x,uE)(x, u_E)1 evolves, though (x,uE)(x, u_E)2 remains reliable empirically. Tasks where success pivots on fine-grained action choices in narrow state regions (e.g., Hopper) can reveal failure modes for state-space novelty approaches. The use of a static threshold (x,uE)(x, u_E)3 may be suboptimal as coverage increases, potentially motivating nonstationary query policies.

Potential Extensions:

Natural directions include implementing time-varying (x,uE)(x, u_E)4 or recalibrating (x,uE)(x, u_E)5 to promote query rate decay as state-space coverage improves; integrating action-space distances or learned adaptive metrics (e.g., Mahalanobis distances) into (x,uE)(x, u_E)6; and extending conformal methods to nonexchangeable or online settings.

CRSAIL represents a geometric, statistically principled approach to expert query control in AIL, providing state-of-the-art query efficiency alongside interpretability and practical deployment robustness (Firouzkouhi et al., 29 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conformalized Rejection Sampling for Active Imitation Learning (CRSAIL).