Impromptu VLA: Vision-Language-Action Data
- The paper introduces the Impromptu VLA Dataset, a large-scale, multimodal resource that integrates vision, language, and action for challenging, unstructured scenarios.
- It employs a robust curation pipeline combining automated chain-of-thought VLM annotation with human verification across eight driving and 11 robotics datasets.
- Empirical findings highlight significant improvements in collision rates and trajectory errors, demonstrating practical benefits for advancing autonomous systems.
The Impromptu VLA Dataset is a large-scale, meticulously curated resource for vision-language-action (VLA) learning, designed to overcome the deficits of conventional autonomous driving and embodied action datasets in capturing unstructured, "corner-case" scenarios. It is structured to provide robust supervision for models requiring joint perception, language understanding, and action planning, with a specific focus on cases that challenge the current state of self-driving intelligence and generalist robotic manipulation.
1. Dataset Composition, Sources, and Curation
The construction of the Impromptu VLA Dataset proceeds from a mass collection and successive filtration of driving footage or embodied episodes, emphasizing rare, unstructured, or open-world tasks.
1.1 Autonomous Driving Domain
Over 2 million raw video clips (≈ 10 TB) were mined from eight open-source, large-scale autonomous driving datasets: Mapillary Vistas (MVD), ONCE, NAVSIM, nuScenes, Waymo Open, Argoverse V2, KITTI, and IDD. The raw clips varied in resolution and temporal frequency. All sequences were standardized to 2 Hz.
Keyclips (6.5 seconds each: 1.5 s past, 5 s future) were extracted for annotation. Stability filtering grouped adjacent keyclips in 15 s local-filter packs; labels persisted only if consistent across at least two "significant" keyclips per pack, reducing transient artifact annotation. An initial 10% of clips were textually described via Qwen2.5-VL 72B, then filtered to remove "conventional" scenes using a prompt-driven VLM classifier, tuned against a 1,000-clip human validation set (>90% fidelity). Semantic clustering and human verification distilled the final dataset to approximately 80,000 core keyclips, distributed across eight sources. An 80/20 stratified train/val split preserved category balance across all unstructured corner-case types (Chi et al., 29 May 2025).
Table: Source Dataset Composition (Autonomous Driving)
| Dataset | Labeled Clips | Camera Views | FPS |
|---|---|---|---|
| Mapillary | 22,062 | 1 | 2 Hz |
| ONCE | 18,093 | 7 | 2 Hz |
| NAVSIM | 18,600 | 8 | 2 Hz |
| nuScenes | 11,370 | 6 | 2 Hz |
| Waymo | 7,530 | 5 | 10 Hz |
| Argoverse V2 | 2,490 | 7 | 20 Hz |
| KITTI | 930 | 1 | 10 Hz |
| IDD | 930 | 1 | 15 Hz |
1.2 Embodied Robotic Manipulation Domain (Interleaved)
A related instantiation applies to generalist robot pipelines, where Impromptu VLA denotes a large-scale, interleaved multimodal dataset ("Open Interleaved X-Embodiment") constructed from 11 text+trajectory robotics datasets, totaling 210,000 interleaved episodes and 13 million frames. Annotation is automated via LLM-based keyphrase extraction, open-vocabulary detection (OWLv2), Qwen2.5-VL verification, episodic prompt assembly, and action sequence retention, as detailed in (Fan et al., 4 May 2025).
2. Novel Taxonomy of Unstructured Scenarios
A central contribution lies in defining a four-category taxonomy to stress-test autonomous systems with scenarios underrepresented in prior corpora (Chi et al., 29 May 2025):
- Roads with Unclear Boundaries: Indistinct or absent navigational cues; challenge for drivable-area segmentation.
- Temporary Traffic-Rule Changes: On-the-fly regulation shifts (e.g., construction, flaggers); challenges sign interpretation and behavioral adaptation.
- Unconventional Dynamic Obstacles: Rare or erratic agents (livestock, off-route cyclists); difficulties in prediction and planning.
- Challenging Road Conditions: Surface/weather-induced perceptual/actuation problems (potholes, ice, glare).
Each category is explicitly operationalized for stratified algorithmic evaluation, with representative frames and qualitative examples provided in the primary source.
3. Annotation Pipelines: Question-Answering and Trajectories
Annotation is organized around two pillars: planning-centric QA supervision and action trajectory labels.
3.1 Planning-Oriented QA Pairs
Annotations employ special tokens to distinguish question types:
<V.R.U.>(vulnerable road users),<T.LIGHT>(traffic lights),<DYNAMIC_OBJECTS>(dynamic actors’ intentions),<PLANNING>(meta-action),<TRAJ>(trajectory forecast).
Annotations are constructed via a chain-of-thought VLM pipeline (Qwen2.5-VL 72B) producing scene descriptions, static/movable feature parsing, category assignment with justification, and multi-task answers. Rigorous human verification (accept/reject/minor correction) yields final QA. QA accuracy is tracked per task and shows substantial post-verification improvements—for example, dynamic object QA accuracy rises from 0.20 to 0.92 and meta-planning from 0.56 to 0.84 on the held-out validation set.
3.2 Action Trajectory Specification
Ego-centric future trajectory labels comprise 2D waypoint sequences at 2 Hz (10 waypoints over 5 s), with the coordinate frame origin at the present ego-pose, X-axis along vehicle heading. Past states (3 points) also include velocity and acceleration vectors. For embodied action, discrete low-level control actions can be specified at regular intervals.
3.3 Data Formats
Data is provided in unified structures (e.g., TFRecord, LMDB for QA; JSON for trajectories), with robust coordinate and timestamp conventions for integration into learning pipelines.
4. Evaluation Metrics and Experimental Protocols
Both open- and closed-loop metrics are used to measure perception, prediction, and planning competency:
4.1 Closed-Loop: NeuroNCAP (NNS)
Collision rates and a scenario score (NNS) quantify real-time driving robustness. The NNS metric,
with the impact speed and reference speed, rewards collision avoidance and mitigated crash severity.
4.2 Open-Loop: Trajectory Error
Per-timestep distance between predicted and ground-truth future waypoints is averaged over 1 s, 2 s, and 3 s horizons: Language annotation quality is evaluated by BLEU-4 and METEOR.
Baselines include strong VLA models (e.g., DriveVLM, UniAD, EMMA). Two critical fine-tuning pipelines are evaluated: direct fine-tuning on nuScenes vs. Impromptu VLA pretraining followed by nuScenes fine-tuning.
5. Empirical Findings and Diagnostic Utility
Substantial performance improvements are documented for VLA models trained with Impromptu VLA:
- Closed-loop NNS improves by 21% (from 1.77 to 2.15/5.00); collision rates drop from 72.5% to 65.5%.
- Open-loop average trajectory error reduces from 0.34 m to 0.30 m (Qwen2.5-VL 3B), matching state-of-the-art.
- Pretraining with Impromptu VLA consistently reduces errors by ≈10–12% across prediction horizons and most markedly enhances side-collision avoidance (+38% NNS).
- QA diagnostics show dynamic object and meta-planning accuracy increases by 0.72 and 0.28 respectively; end-to-end predicted trajectory error drops from 6.62 m to 0.69 m.
These results underline the dataset's effectiveness for learning complex, generalizable perception and planning, especially in unstructured environments (Chi et al., 29 May 2025).
6. Integration and Practical Use
Official code, pretrained weights, and 43 GB of annotated keyclips are publicly available. Integration guidelines mandate 2 Hz, 6.5 s clip formatting, and category-preserving splits. Special QA tokens must be preserved for sequence labeling stability. The curation pipeline can be extended to novel unstructured scenarios using provided "pipeline.py" scripts, which automate VLM-based keyclip vetting, category assignment, QA generation, trajectory extraction, and human-in-the-loop verification:
1 2 3 4 5 6 7 8 |
for raw_clip in new_dataset: desc = VLM.describe(raw_clip) # Chain-of-Thought if not VLM.is_unstructured(desc): continue # filter conventional cat = VLM.categorize(desc) # one of 4 classes QA = VLM.generate_QA(raw_clip, cat) # multi-task QA traj = extract_ground_truth_trajectory(raw_clip) if human_verify(desc, QA, traj): append_to_ImpromptuVLA(raw_clip, desc, QA, traj) |
7. Adaptation and Extensions
Blueprints adapted from the CoVLA dataset (Arai et al., 2024) provide detailed guidance for on-the-fly, edge-driven data acquisition, including timestamp-synchronized multi-sensor logging, real-time annotation, and human-in-the-loop corrections to reduce VLM hallucinations. This allows Impromptu VLA construction in dynamic, real-world contexts (e.g., fleet driving, heterogeneous robot deployments). The interleaved embodied version, as used in Interleave-VLA (Fan et al., 4 May 2025), can be scaled to new platforms by running the open pipeline on instruction+observation+action sources, enabling rapid augmentation of zero-shot generalization and flexible multi-modality policy training.
A plausible implication is that the Impromptu VLA approach—incorporating large-scale LLM-driven selection and automated, planning-centric annotation—establishes a new methodological standard for domain-complete, open-world VLA research and benchmark construction in both autonomous driving and generalist robotics.