- The paper introduces FIDeL, a pipeline that integrates representation-based anomaly detection, conformal prediction thresholding, and VLM-based semantic filtering to identify failures in imitation learning.
- It demonstrates significant performance gains, with a +5.3% AUROC improvement on the BotFails benchmark and 86.72% AUROC on a real-world soldering task, outperforming baseline methods.
- The modular design decouples anomaly detection from semantic failure identification, enhancing interpretability and reducing false positives in robotic deployments.
Failure Identification in Imitation Learning via Statistical and Semantic Filtering
Introduction
The brittleness of imitation learning (IL) policies in robotic deployments arises from their inability to robustly handle out-of-distribution states, hardware faults, or human-induced disturbances during real-world execution. Existing vision-based anomaly detection (AD) techniques, while effective at detecting deviations from nominal behavior, exhibit a high false positive rate by failing to discriminate benign anomalies from true failures. The paper introduces FIDeL (Failure Identification in Demonstration Learning), a modular, policy-independent failure detection pipeline for robotic IL policies, emphasizing robust anomaly detection, adaptive thresholding, and semantic failure identification. FIDeL integrates one-class representation-based AD, dynamic conformal thresholding, and a VLM-based semantic filtering module.
Figure 1: Schematic of FIDeL, showing offline encoding, conformal calibration, online anomaly alignment, and semantic filtering.
Methodology
Representation-based Anomaly Detection
FIDeL constructs a compact statistical memory M of expert demonstrations. Visual observations are encoded via a fixed vision backbone (e.g., ResNet-18 or DINOv2) into patch-level representations. At inference, incoming observations are aligned with memory entries using optimal transport (OT) to compute anomaly scores. This OT-based permutation-invariant alignment enables robustness to variations in spatial location and temporal execution speed.
To convert anomaly scores into decisions, FIDeL leverages an extension of conformal prediction (CP), calibrating spatially- and temporally-aware thresholds from nominal data. CP maintains coverage guarantees independent of score distribution, outperforming parametric Gaussian thresholding in both accuracy and generalization—crucial for deployment over high-variance or non-Gaussian tasks.
VLM-based Semantic Filtering
Recognizing that anomaly scores alone are insufficient for accurate failure detection, FIDeL employs a post-hoc semantic filter using Qwen-2.5-7B Vision-LLM. The VLM is prompted with the candidate frame, corresponding expert reference, patch-wise anomaly heatmap, and contextual instruction. This enables discriminative reasoning, distinguishing task-critical failures (e.g., object drops, state-violating manipulations) from benign visual disturbances (e.g., scene clutter, minor background motion).
Figure 2: Patch-wise heatmaps localize anomalies detected by the representation-based AD and CP thresholding. Examples show both benign and failure-related anomalies.
BotFails Dataset
A notable contribution is the construction of BotFails, a multimodal benchmark for anomaly and failure detection in robotics. BotFails includes 10 tasks spanning domestic and industrial scenarios, each annotated with nominal, benign anomaly, and genuine failure frames, capturing a broad spectrum of out-of-distribution events relevant for real-world robotic deployment.
Figure 3: Representative tasks from BotFails – diverse domestic and industrial settings used for failure detection benchmarking.
Numerical Results
FIDeL outperforms multiple classes of AD and failure detection baselines on both the BotFails dataset and real robot execution traces. In anomaly detection, FIDeL yields a +5.3% AUROC improvement over the best alternative (logpZ0) on BotFails, and achieves 86.72% AUROC on the Real-Ï€ soldering task, demonstrating strong discriminative capacity.
Figure 4: Comparative thresholding performance across CP-time, CP-time+space, and Gaussian baselines.
The integration of semantic filtering achieves an additional +17.4% increase in end-to-end failure-detection accuracy, substantially reducing false positives attributed to benign deviations, especially compared to baseline CNF- and AE-based detectors.
Figure 5: End-to-end accuracy/weighted accuracy of FIDeL vs. other robotic failure monitoring pipelines.
Temporal/spatial conformal thresholding consistently surpasses other thresholding strategies in both balanced and weighted accuracy. Conformal thresholding ensures robustness under distributional shift, while Gaussian thresholding fails in tasks with heavy-tailed or multi-modal anomaly scores. The margin between raw anomaly detection and failure detection highlights the essential role of semantic filtering in practical deployments.
Theoretical and Practical Implications
By decoupling anomaly detection from semantic failure identification, FIDeL sets a new standard for interpretable, policy-independent robot monitoring pipelines. The use of optimal transport for representation alignment exploits structure in nominal demonstrations, while conformal prediction provides non-parametric guarantees on the false positive rate. The VLM filter showcases the capacity of large vision-LLMs to infuse task semantics and reason explicitly over local visual deviations.
Practically, this approach achieves greater operational stability for robotic systems exposed to unstructured environments, minimizing both unnecessary policy interruptions (by filtering benign events) and undetected catastrophic failures. The modular design facilitates integration with existing IL policies without requiring task- or policy-specific retraining.
Limitations and Future Directions
While FIDeL significantly improves failure detection, dependence on the diversity of demonstration data can lead to over-flagging benign state variations as anomalies, increasing VLM invocation cost. Current semantic filtering remains computationally expensive and sensitive to the prompt design and VLM calibration. Spatial and temporal invariance, though beneficial for many tasks, may limit sensitivity to non-Markovian failures where the action history is essential for correct classification.
Future research directions include explicit temporal modeling for sequential anomaly/failure detection, low-cost VLM alternatives for semantic filtering, and leveraging generative uncertainty estimation as an additional runtime supervision signal. Extending BotFails and similar benchmarks to massively multi-task and open-world domains will be key for the field.
Conclusion
FIDeL establishes a robust, interpretable, and high-accuracy approach for failure detection in imitation learning-driven robotics. By combining optimal transport–based AD, dynamic conformal thresholding, and semantic VLM filtering, the framework overcomes key limitations of existing methods—delivering strong gains both in anomaly and end-to-end failure identification. These methodological advances materially increase the reliability, explainability, and deployability of IL policies in safety-critical robotic applications.