ScenePilot-4K: Driving Dataset for VLMs

Updated 3 February 2026

ScenePilot-4K is a large-scale driving video dataset featuring 27.7 million annotated front-view frames from 63 countries, designed to assess autonomous driving systems.
It employs a fully automated pipeline using VGGT, YOLO11s, and Qwen2-VL-72B-Instr to generate multi-granularity annotations, including scene descriptions, risk scores, and object detection.
The dataset underpins ScenePilot-Bench’s four-axis evaluation, measuring scene understanding, spatial perception, motion planning, and GPT-Score alignment for safety-aware model assessment.

ScenePilot-4K is a large-scale driving video dataset designed to support comprehensive evaluation of vision-LLMs (VLMs) in autonomous driving contexts. Comprising 3,847 hours of first-person driving footage from 63 countries across 1,210 cities, and sampled at 2 FPS to yield 27.7 million annotated front-view frames, ScenePilot-4K pairs global diversity and scale with multi-granularity annotations suitable for modeling scene semantics, perception, and planning. The dataset forms the empirical foundation for ScenePilot-Bench, enabling fine-grained, safety-aware model assessment along a four-axis evaluation suite for scene understanding, spatial perception, motion planning, and GPT-Score alignment. All statistics, methodological details, and roles cited here appear in (Wang et al., 27 Jan 2026).

1. Dataset Scale, Composition, and Scene Distributions

ScenePilot-4K contains 3,847 hours of ego-centric driving video, partitioned into 27.7 million front-view frames by sampling at 2 Hz. The collection spans 63 countries/regions and 1,210 cities, yielding extensive geographic and environmental variety.

Environmental Class Distributions

Attribute	Categories/Values	Proportions (%)
Road Type	Urban	60.3
	Highway	19.0
	Rural	14.4
	Suburban	6.3
Weather	Cloudy	48.2
	Sunny	46.1
	Foggy	2.9
	Rainy	2.3
	Snowy	0.5
Time of Day	Day	92.1
	Night	7.9
Intersection	No	70.8
	Yes	29.2
Lane Count	≤2 lanes	35.0
	≥3 lanes	65.0
Risk Level	Low (1–3)	49.5
	Medium (4–7)	49.7
	High (8–10)	0.8
Driving Side	Right-hand	97.6
	Left-hand	2.4

Aggregated statistics over scene/environment types use the mean μ and variance σ² of per-type fractions $p_k$ as $\mu = (1/|K|) \sum_k p_k$ , $\sigma^2 = (1/|K|)\sum_k (p_k - \mu)^2$ .

A mean of approximately 2.4 objects is detected per frame (variance ≈ 1.1), and the median ego-centric distance for detected objects is ~15 meters (5th–95th percentile range: 4–45 m). The empirical risk score $r$ exhibits a long tail, with only 0.8% of clips scoring r ≥ 8.

2. Data Acquisition and Processing Pipeline

Videos are sourced from Bilibili and YouTube, building on OpenDV-2K, and pre-processed to yield non-overlapping 5-second clips (each with 10 frames at 2 Hz). The initial and final 180 seconds of each raw video are discarded to mitigate artifacts from vehicle start/stop behavior.

All data are produced from a single front-facing monocular camera (full HD or better). No additional sensor modalities (such as lidar or RGB-D) are included, reflecting a pure vision-based pipeline.

Ego-motion is estimated with VGGT, producing per-frame rotations $R_t \in SO(3)$ and translations $t_t \in \mathbb{R}^3$ , enabling recovery of the camera-to-world extrinsics $T_{c \rightarrow w} \in SE(3)$ , with ego position at time $t$ : $C_t = -R_t^T t_t$ . Metric scale is calibrated using robust statistics on detected road ground-planes: median and median absolute deviation (MAD), with object priors for further correction.

3. Multi-Granularity Annotation Schema

Annotations are produced by an automated pipeline that integrates:

Scene descriptions and risk assessments: Qwen2-VL-72B-Instr generates natural language scene descriptions per clip's fourth frame. Example schema: "The weather is ⟨weather⟩, and it is ⟨time⟩. The road type is ⟨type⟩ and the road has ⟨lanes⟩ lanes. It is ⟨intersection/not⟩, and the risk level score is ⟨r⟩." Risk is labeled as a scalar $r \in \{1,\ldots,10\}$ (low: 1–3, medium: 4–7, high: 8–10), as well as by Risk-Class-Acc.
Object detection and key participant identification: YOLO11s detects and segments (via SAM, refined by morphology) five classes: vehicle, truck, bicycle, motorcycle, and pedestrian. Class-specific detection thresholds are used (e.g., $T_{\text{vehicle}}=0.5$ , $T_{\text{pedestrian}}=0.55$ ). Output per detection includes class label, normalized bounding box, binary mask, and centroid in 3D.
Ego trajectories and camera calibration: Each clip provides a world-aligned ego trajectory $T_{\text{ego}} \in \mathbb{R}^{10 \times 3}$ (sampled at 2 Hz) and per-frame camera intrinsics $K \in \mathbb{R}^{3 \times 3}$ and extrinsics $T_{c \rightarrow w}$ . Calibration diagnostics include inlier counts and MAD.

The annotation process is fully deterministic and automated, employing robust statistics to control scale biases and balance recall against false positive rates in detection. Annotation quality depends on the performance of VGGT and YOLO11s; no human inter-annotator agreement is reported.

4. Dataset Partitioning, Regional Diversity, and Generalization Splits

Dataset splits are constructed with "multidimensional conditional uniformity" to ensure balanced distribution across weather, road-type, lane-count, intersection, risk, traffic density, and region. Typical experimental partitions include 200,000 VQA samples for fine-tuning and 100,000 for testing.

Cross-region generalization is supported by:

Leave-One-Country-Out (LOCO): For each country, a model is trained on data from the other 62 and evaluated on the held-out country.
Right-to-left traffic adaptation: Training uses right-hand-driving countries (e.g., China, US) and evaluation is on left-hand-driving regions (e.g., Japan, UK). Generalization loss is quantified as $\Delta = (\text{Score}_{\text{in-domain}} - \text{Score}_{\text{cross}})/\text{Score}_{\text{in-domain}}$ .

This design allows for empirical assessment of both spatial (regional) and cultural (traffic norm) model robustness.

5. Data Distributions and Supporting Statistics

Scene, object, and risk distributions provide context for benchmark interpretation:

Scene attribute percentages (per clip):
- Cloudy 48.2%, Sunny 46.1%, Foggy 2.9%, Rainy 2.3%, Snowy 0.5%
- Day 92.1%, Night 7.9%
- Urban 60.3%, Highway 19.0%, Rural 14.4%, Suburban 6.3%
- Non-intersection 70.8%, Intersection 29.2%
- ≤2 lanes 35.0%, ≥3 lanes 65.0%
- Low risk 49.5%, Medium risk 49.7%, High risk 0.8%
Object detection: Mean number of objects per frame ≈ 2.4, variance ≈ 1.1.
Distance statistics: Median detected object distance from ego ≈ 15 m; 5th and 95th percentiles at 4 m and 45 m.
Risk score: 0.8% of clips have risk $r \geq 8$ , evidencing a highly skewed distribution.

These statistics inform dataset diversity and bias, as well as baseline and ceiling expectations for downstream models.

6. Role in Vision-LLM Evaluation

ScenePilot-4K supports a unified, multi-axis evaluation of VLMs via ScenePilot-Bench, emphasizing safety-critical autonomous driving applications.

Four-Axis Evaluation Suite

Scene Understanding: Evaluated with SPICE for semantic consistency in textual scene descriptions and Risk-Class-Acc for safety reasoning.
Spatial Perception: Assessed using object Class-Acc, ego-centric (EMRDE, EMRAE) and object-centric (OMRDE, OMRAE) distance and angle errors.
Motion Planning: Benchmarked by meta-action classification (accelerate, brake, turn, etc.), direction consistency (DCS-Acc), error on acceleration/heading (MRE-Acc, ARE), and trajectory displacement (ADE, FDE@T).
GPT-Score: Assigns a $[0, 1]$ alignment score via GPT-40 between model outputs and ground truth.

Task examples derived from ScenePilot-4K include multi-attribute scene characterization, spatial queries (e.g., object-ego distances), hazard identification, and waypoint generation.

A weighting strategy balances semantic, spatial, planning, and GPT-Score components into a [0, 100] unified metric, facilitating direct cross-task comparison.

By providing robust and diverse annotational support—from global language descriptions to per-frame 3D geometry—ScenePilot-4K enables empirical assessment and development of VLMs in autonomous driving, with emphasis on holistic, safety-aware reasoning and generalization across countries, environments, and driving norms (Wang et al., 27 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

ScenePilot-Bench: A Large-Scale Dataset and Benchmark for Evaluation of Vision-Language Models in Autonomous Driving (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ScenePilot-4K.