Impromptu VLA: Vision-Language-Action Data

Updated 21 February 2026

The paper introduces the Impromptu VLA Dataset, a large-scale, multimodal resource that integrates vision, language, and action for challenging, unstructured scenarios.
It employs a robust curation pipeline combining automated chain-of-thought VLM annotation with human verification across eight driving and 11 robotics datasets.
Empirical findings highlight significant improvements in collision rates and trajectory errors, demonstrating practical benefits for advancing autonomous systems.

The Impromptu VLA Dataset is a large-scale, meticulously curated resource for vision-language-action (VLA) learning, designed to overcome the deficits of conventional autonomous driving and embodied action datasets in capturing unstructured, "corner-case" scenarios. It is structured to provide robust supervision for models requiring joint perception, language understanding, and action planning, with a specific focus on cases that challenge the current state of self-driving intelligence and generalist robotic manipulation.

1. Dataset Composition, Sources, and Curation

The construction of the Impromptu VLA Dataset proceeds from a mass collection and successive filtration of driving footage or embodied episodes, emphasizing rare, unstructured, or open-world tasks.

1.1 Autonomous Driving Domain

Over 2 million raw video clips (≈ 10 TB) were mined from eight open-source, large-scale autonomous driving datasets: Mapillary Vistas (MVD), ONCE, NAVSIM, nuScenes, Waymo Open, Argoverse V2, KITTI, and IDD. The raw clips varied in resolution and temporal frequency. All sequences were standardized to 2 Hz.

Keyclips (6.5 seconds each: 1.5 s past, 5 s future) were extracted for annotation. Stability filtering grouped adjacent keyclips in 15 s local-filter packs; labels persisted only if consistent across at least two "significant" keyclips per pack, reducing transient artifact annotation. An initial 10% of clips were textually described via Qwen2.5-VL 72B, then filtered to remove "conventional" scenes using a prompt-driven VLM classifier, tuned against a 1,000-clip human validation set (>90% fidelity). Semantic clustering and human verification distilled the final dataset to approximately 80,000 core keyclips, distributed across eight sources. An 80/20 stratified train/val split preserved category balance across all unstructured corner-case types (Chi et al., 29 May 2025).

Table: Source Dataset Composition (Autonomous Driving)

Dataset	Labeled Clips	Camera Views	FPS
Mapillary	22,062	1	2 Hz
ONCE	18,093	7	2 Hz
NAVSIM	18,600	8	2 Hz
nuScenes	11,370	6	2 Hz
Waymo	7,530	5	10 Hz
Argoverse V2	2,490	7	20 Hz
KITTI	930	1	10 Hz
IDD	930	1	15 Hz

1.2 Embodied Robotic Manipulation Domain (Interleaved)

A related instantiation applies to generalist robot pipelines, where Impromptu VLA denotes a large-scale, interleaved multimodal dataset ("Open Interleaved X-Embodiment") constructed from 11 text+trajectory robotics datasets, totaling 210,000 interleaved episodes and 13 million frames. Annotation is automated via LLM-based keyphrase extraction, open-vocabulary detection (OWLv2), Qwen2.5-VL verification, episodic prompt assembly, and action sequence retention, as detailed in (Fan et al., 4 May 2025).

2. Novel Taxonomy of Unstructured Scenarios

A central contribution lies in defining a four-category taxonomy to stress-test autonomous systems with scenarios underrepresented in prior corpora (Chi et al., 29 May 2025):

Roads with Unclear Boundaries: Indistinct or absent navigational cues; challenge for drivable-area segmentation.
Temporary Traffic-Rule Changes: On-the-fly regulation shifts (e.g., construction, flaggers); challenges sign interpretation and behavioral adaptation.
Unconventional Dynamic Obstacles: Rare or erratic agents (livestock, off-route cyclists); difficulties in prediction and planning.
Challenging Road Conditions: Surface/weather-induced perceptual/actuation problems (potholes, ice, glare).

Each category is explicitly operationalized for stratified algorithmic evaluation, with representative frames and qualitative examples provided in the primary source.

3. Annotation Pipelines: Question-Answering and Trajectories

Annotation is organized around two pillars: planning-centric QA supervision and action trajectory labels.

3.1 Planning-Oriented QA Pairs

Annotations employ special tokens to distinguish question types:

<V.R.U.> (vulnerable road users),
<T.LIGHT> (traffic lights),
<DYNAMIC_OBJECTS> (dynamic actors’ intentions),
<PLANNING> (meta-action),
<TRAJ> (trajectory forecast).

Annotations are constructed via a chain-of-thought VLM pipeline (Qwen2.5-VL 72B) producing scene descriptions, static/movable feature parsing, category assignment with justification, and multi-task answers. Rigorous human verification (accept/reject/minor correction) yields final QA. QA accuracy is tracked per task and shows substantial post-verification improvements—for example, dynamic object QA accuracy rises from 0.20 to 0.92 and meta-planning from 0.56 to 0.84 on the held-out validation set.

3.2 Action Trajectory Specification

Ego-centric future trajectory labels comprise 2D waypoint sequences at 2 Hz (10 waypoints over 5 s), with the coordinate frame origin at the present ego-pose, X-axis along vehicle heading. Past states (3 points) also include velocity and acceleration vectors. For embodied action, discrete low-level control actions can be specified at regular intervals.

3.3 Data Formats

Data is provided in unified structures (e.g., TFRecord, LMDB for QA; JSON for trajectories), with robust coordinate and timestamp conventions for integration into learning pipelines.

4. Evaluation Metrics and Experimental Protocols

Both open- and closed-loop metrics are used to measure perception, prediction, and planning competency:

4.1 Closed-Loop: NeuroNCAP (NNS)

Collision rates and a scenario score (NNS) quantify real-time driving robustness. The NNS metric,

$\text{NNS} = \begin{cases} 5.0, & \text{if no collision} \ 4.0 \cdot \max(0, 1-v_i/v_r), & \text{otherwise} \end{cases}$

with $v_i$ the impact speed and $v_r$ reference speed, rewards collision avoidance and mitigated crash severity.

4.2 Open-Loop: Trajectory Error

Per-timestep $L_2$ distance between predicted and ground-truth future waypoints is averaged over 1 s, 2 s, and 3 s horizons: $\overline{L}_2 = \frac{1}{T} \sum_{t=1}^{T} e_t, \quad e_t = \|p_t^{\mathrm{pred}} - p_t^{\mathrm{gt}}\|_2$ Language annotation quality is evaluated by BLEU-4 and METEOR.

Baselines include strong VLA models (e.g., DriveVLM, UniAD, EMMA). Two critical fine-tuning pipelines are evaluated: direct fine-tuning on nuScenes vs. Impromptu VLA pretraining followed by nuScenes fine-tuning.

5. Empirical Findings and Diagnostic Utility

Substantial performance improvements are documented for VLA models trained with Impromptu VLA:

Closed-loop NNS improves by 21% (from 1.77 to 2.15/5.00); collision rates drop from 72.5% to 65.5%.
Open-loop average $L_2$ trajectory error reduces from 0.34 m to 0.30 m (Qwen2.5-VL 3B), matching state-of-the-art.
Pretraining with Impromptu VLA consistently reduces $L_2$ errors by ≈10–12% across prediction horizons and most markedly enhances side-collision avoidance (+38% NNS).
QA diagnostics show dynamic object and meta-planning accuracy increases by 0.72 and 0.28 respectively; end-to-end predicted trajectory error drops from 6.62 m to 0.69 m.

These results underline the dataset's effectiveness for learning complex, generalizable perception and planning, especially in unstructured environments (Chi et al., 29 May 2025).

6. Integration and Practical Use

Official code, pretrained weights, and 43 GB of annotated keyclips are publicly available. Integration guidelines mandate 2 Hz, 6.5 s clip formatting, and category-preserving splits. Special QA tokens must be preserved for sequence labeling stability. The curation pipeline can be extended to novel unstructured scenarios using provided "pipeline.py" scripts, which automate VLM-based keyclip vetting, category assignment, QA generation, trajectory extraction, and human-in-the-loop verification:

for raw_clip in new_dataset:
    desc = VLM.describe(raw_clip)                # Chain-of-Thought
    if not VLM.is_unstructured(desc): continue   # filter conventional
    cat  = VLM.categorize(desc)                  # one of 4 classes
    QA   = VLM.generate_QA(raw_clip, cat)        # multi-task QA
    traj = extract_ground_truth_trajectory(raw_clip)
    if human_verify(desc, QA, traj):
        append_to_ImpromptuVLA(raw_clip, desc, QA, traj)

7. Adaptation and Extensions

Blueprints adapted from the CoVLA dataset (Arai et al., 2024) provide detailed guidance for on-the-fly, edge-driven data acquisition, including timestamp-synchronized multi-sensor logging, real-time annotation, and human-in-the-loop corrections to reduce VLM hallucinations. This allows Impromptu VLA construction in dynamic, real-world contexts (e.g., fleet driving, heterogeneous robot deployments). The interleaved embodied version, as used in Interleave-VLA (Fan et al., 4 May 2025), can be scaled to new platforms by running the open pipeline on instruction+observation+action sources, enabling rapid augmentation of zero-shot generalization and flexible multi-modality policy training.

A plausible implication is that the Impromptu VLA approach—incorporating large-scale LLM-driven selection and automated, planning-centric annotation—establishes a new methodological standard for domain-complete, open-world VLA research and benchmark construction in both autonomous driving and generalist robotics.

Markdown Report Issue Upgrade to Chat

References (3)

Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models (2025)

Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions (2025)

CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Impromptu VLA Dataset.