PeSOTIF Dataset Benchmark
- PeSOTIF dataset is a visual benchmark curated to evaluate and quantify failure modes in perception systems during long-tail traffic scenarios.
- It comprises 1,126 carefully annotated RGB images split into environment and object subsets that capture rare, challenging conditions.
- The dataset supports detailed evaluation using COCO-style metrics and entropy-based uncertainty measures, bridging classical detectors and LVLMs.
The PeSOTIF (Perception Safety of the Intended Functionality) dataset is a visual benchmark explicitly curated for evaluating and quantifying the failure modes of perception systems in long-tail traffic scenarios, particularly those relevant to SOTIF risk in automated vehicles. The dataset targets algorithmic shortcomings under operational conditions, rather than sensor failures, by assembling 1,126 key frames exhibiting rare environmental and object-centric trigger conditions systematically overlooked in conventional datasets such as KITTI, nuScenes, and BDD100K. PeSOTIF is structured to facilitate benchmarking of both classical and probabilistic object detectors, as well as emerging large vision-LLMs (LVLMs), under precisely labeled adverse scenarios where the intended perception functionality can break down (Zhou et al., 30 Jan 2026, Peng et al., 2022).
1. Dataset Structure and Composition
PeSOTIF comprises 1,126 single RGB images selected primarily from real-world dash-cam footage, supplemented by scenes synthesized through targeted computer-generated perturbations. Each image contains at least one annotated ground-truth object from 11 traffic-related classes: car, bus, truck, train, bike, motor, person, rider, traffic sign, traffic light, and traffic cone. The average per-frame object count is approximately 2.27 "key objects" and 2.47 "normal objects" (Peng et al., 2022).
The dataset is split into two principal subsets reflecting SOTIF-triggering scene complexities:
- Environment subset: Frames stressing visibility and perception via environmental effects such as weather (rain, snow), lighting (glare, darkness), and particulates (dust, smoke).
- Object subset: Frames containing rare or visually atypical obstacles, including overturned vehicles, stray cargo, and unusual road users.
A high-level summary is presented in the table below:
| Subset | Primary Focus | Scene Types Covered |
|---|---|---|
| Environment | Weather, illumination, particulates | Rain, snow, fog, glare |
| Object | Atypical obstacles, rare events | Overturned vehicles, debris |
2. Trigger Condition Taxonomy and Annotation
Trigger sources within PeSOTIF are categorized hierarchically into primary, secondary, and tertiary labels, distinguishing both environmental and object-centric challenges. Major trigger categories include rain, snow, particulate, illumination, and unusual object appearances/postures. Perturbations are further designated as "natural" (inherent to the real-world scene) or "handcraft" (synthetically superimposed on base images).
Each bounding box is annotated according to the following schema:
- class: Integer (0–10) representing the object category.
- x_center, y_center, width, height: All normalized to according to image coordinates.
- safety_critical: Binary flag, with 1 indicating the object is a "key object" (i.e., failure to detect constitutes an immediate hazard), and 0 indicating a "normal object."
Perturbation and degradation levels—including severity (mild, medium, severe) and type (natural, handcraft, object)—are included as per-image metadata, allowing stratified performance analysis by degradation strength (Zhou et al., 30 Jan 2026, Peng et al., 2022).
3. Data Acquisition and Curation
Frames were sourced through:
- Controlled experiments on rain intensity and lighting using calibrated hardware (e.g., FLIR GS3-U3-41C6C-C).
- Real-world accident videos from the China SOTIF technical alliance and online repositories.
- Selection from public datasets prioritizing adverse conditions (e.g., ExDark, Raindrop for weather; BDD100K, KITTI for onboard traffic).
Human verification ensured each frame surpassed conventional dataset difficulty, included at least one valid object, and maintained sufficient visual quality for annotation. Synthetic perturbations (e.g., lens flare, contrast reduction, various noise models) were programmatically overlaid on selected frames to systematically vary scene difficulty (Peng et al., 2022).
4. Evaluation Protocols and Metrics
PeSOTIF is distributed exclusively as a test corpus; no official training/validation/test splits are provided, consistent with its design as a perception stress test. Evaluations report on the entire 1,126-frame set, with separate breakdowns for environment and object subsets.
Benchmarking follows COCO-style 2D object detection conventions:
- Intersection over Union (IoU):
- Precision/Recall at threshold : ,
- Average Precision (AP): per class at IoU=0.50
- Mean Average Precision (mAP): mean over classes at IoU=0.50
- Mean, multi-threshold AP (): averaged over IoU thresholds [0.50, 0.95] in 0.05 increments
- Mean Average Recall (mAR): mean recall over classes at IoU=0.50
For probabilistic object detection (POD), further metrics are proposed:
- Alert Coverage Rate (ACR): fraction of key objects flagged as uncertain.
- False Alert Rate (FAR): proportion of normal objects wrongly flagged as uncertain.
- Classification Quality Score (CQS) and Uncertainty Quality Score (UQS): aggregate consistency measures between detection accuracy and quantification of uncertainty.
5. Perception SOTIF Entropy and Uncertainty Quantification
To assess SOTIF-driven perception risk, PeSOTIF supports quantification via entropy-based analysis tailored to multiple model predictions. For an image and class with stochastic model samples , per-class mean output probabilities are averaged:
The associated binary decision entropy for each class reads:
The final SOTIF entropy aggregates across classes, with penalization for missing ("ghost") detections:
where is the number of samples detecting the object, and is a penalty factor. When exceeds a threshold , the perception system triggers an alert. Experiments with an ensemble of 5 YOLOv5 models (threshold ) yielded ACR of 90.0%, FAR of 10.8%, CQS of 0.858, and UQS of 4.238, with breakdowns per subset provided in the original evaluation (Peng et al., 2022).
6. Benchmarking LVLMs and Traditional Detectors
Recent studies evaluate Large Vision-LLMs (LVLMs) such as Gemini 3, Doubao, and GPT-5 on PeSOTIF. Models are prompted with visual rulers and must output JSON-formatted bounding boxes and class IDs. LVLMs surpassed the YOLOv5 baseline by over 25 percentage points in recall under natural degradations (e.g., rain, sun glare, fog), demonstrating increased robustness to semantic ambiguity. Conversely, YOLOv5 maintained a lead in geometric mAP under synthetic perturbations, highlighting a trade-off between semantic reasoning (favoring recall) and spatial regression (favoring precision).
Key benchmarking observations:
| Detector | Recall (natural, Δ vs. YOLO) | Geometric mAP under synthetic | Precision–recall trade-off |
|---|---|---|---|
| LVLMs (e.g. Gemini 3) | +25 percentage points | Lower than YOLOv5 | Favor recall, semantic robustness |
| YOLOv5 | Baseline | Higher precision | Retain spatial accuracy |
This complementary behavior supports deployment of LVLMs as high-level safety validators, providing semantic redundancy alongside conventional geometric detectors (Zhou et al., 30 Jan 2026).
7. Access, Extensibility, and Research Utility
PeSOTIF’s first batch is publicly available under a CC BY-NC-SA 4.0 license at https://github.com/SOTIF-AVLab/PeSOTIF. The annotation is provided in both YOLO and COCO formats, with a Python-based evaluation toolkit supporting computation of all major metrics and uncertainty analyses described above.
Ongoing expansion includes new trigger conditions, additional annotated frames suitable for training, and extended modality coverage (e.g., LiDAR). External contributors are encouraged to submit further scenes and annotations or adapt the dataset’s schema for related research. A plausible implication is that PeSOTIF will enable increasingly comprehensive evaluation of perception robustness and SOTIF compliance in future automated driving pipelines (Peng et al., 2022).