ProgressLM-45K: Multimodal Robot Task Dataset
- The paper introduces ProgressLM-45K, a dataset providing fine-grained, continuous progress supervision for vision-language models in robotic manipulation.
- It leverages segmented expert trajectories and both visual and textual demonstrations, using interval and boundary sampling for robust progress estimation.
- The dataset supports supervised and reinforcement learning, underpinning the training of ProgressLM-3B and advancing state-of-the-art progress reasoning in complex tasks.
ProgressLM-45K is a large-scale, multimodal dataset specifically constructed to enable the study and training of vision-LLMs (VLMs) in the task of progress estimation over complex, long-horizon robotic manipulation activities. Developed to address limitations in existing evaluation protocols for progress reasoning, ProgressLM-45K provides fine-grained, continuous progress supervision and comprehensive coverage of answerable and unanswerable settings across visual and textual demonstration modalities. It is the foundational resource underlying the supervised and reinforcement learning training of ProgressLM-3B and is situated as a complement to the Progress-Bench evaluation suite (Zhang et al., 21 Jan 2026).
1. Dataset Construction and Sourcing
All raw trajectories in ProgressLM-45K originate from the RoboMind manipulation benchmark, with the dataset constructed from approximately 240 held-out RoboMind trajectories spanning four distinct robot platforms: Franka Panda, UR5e, AgileX Cobot, and X-Humanoid. The 240 RoboMind trajectories used for Progress-Bench evaluation are explicitly excluded, guaranteeing no overlap. Each trajectory consists of either vision-based or text-based demonstrations, systematically derived from expert manipulations segmented into semantically meaningful task steps (e.g., "grasp object," "lift object," "place object").
For vision demonstrations, a sequence of key frames is extracted at salient transition points, with each frame pre-annotated with a canonical progress value corresponding to task stage completion. The text demonstrations parallel this structure, converting the same expert trajectories into stepwise action instructions , each paired with the corresponding progress value .
Observation sampling leverages both interval and boundary strategies: within each segment between consecutive key frames , intermediate observations are selected at fractional positions —coarsely at fixed intervals and densely near phase boundaries () to emphasize challenging examples. The ground-truth progress for each observation is assigned by linear interpolation:
Each sampled observation is thus paired with and the discrete reference step index .
Unanswerable (“N/A”) cases are enriched as follows: for text, a Qwen2.5-VL-72B model is used to swap target objects throughout the demonstration, preserving spatial markers but breaking semantic alignment; for images, Adversarial edits such as object color swaps or removals are prompted, with results manually curated for realism.
2. Data Composition, Structure, and Summary Statistics
ProgressLM-45K contains a total of 45,000 samples, partitioned into 25,000 supervised Chain-of-Thought (CoT) samples for supervised fine-tuning (SFT) and 20,000 reinforcement learning (RL) samples for policy refinement. Each of the four robot platforms is represented in every split to enforce embodiment-agnostic progress reasoning.
Each sample falls into one of five demo-observation settings: vision/same-view (20%), vision/cross-view (20%), text-only (20%), vision-unanswerable (20%), and text-unanswerable (20%). Progress labels are continuous in , yielding an approximately uniform sampling density. The modal number of steps per trajectory is , with each vision demo comprising keyframes and each text demo sentences.
The average Chain-of-Thought annotation consists of approximately 40–80 tokens per sample. Discrete stage indices are included for alignment analysis. RL samples exclude CoT fields but retain ground-truth reference and score for reward computations.
A tabular CSV summary file accompanies the dataset for each sample, logging demo type, demo length, , , reference step, and answerability status.
3. Task Definition and Labeling Scheme
ProgressLM-45K frames the progress estimation task as follows: given a demonstration (visual or textual) containing reference task steps and a single observed frame , the model must predict a normalized progress score or "N/A" if alignment is impossible. The canonical interpolation formula defines ground-truth progress for intermediate observations:
where indexes the preceding reference step. In some analyses, the best matching stage is defined as:
Unanswerable ("N/A") labeling is strictly enforced for samples where observation and demonstration are semantically inconsistent by construction—either via textual object swaps or manipulated visual content.
Supervised samples include detailed CoT annotations:
<ref_think>: reasoning trace for aligning the observation with a demo step,<ref>: discrete step index,<score_think>: reasoning trace for progress interpolation,<score>: final progress value .
During reinforcement learning, samples are presented without CoT fields, and rewards are computed based on reference and score accuracy, as well as output format.
4. Data Formats, Storage, and Examples
ProgressLM-45K is distributed in JSONL format with two primary splits: train_SFT.jsonl (25,000 CoT-labeled samples) and train_RL.jsonl (20,000 RL samples). All image assets are JPEG or PNG, with maximum resolutions of , organized under /images/{split}/{vision,neg-vision}/....
A typical CoT vision sample follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
{
"demo_type": "vision",
"demo": [
{"frame": "step1.png", "progress": 0.0},
...,
{"frame": "step4.png", "progress": 60.0},
...,
{"frame": "step6.png", "progress": 100.0}
],
"observation": "obs_4721.png",
"ref_gt": 4,
"score_gt": 66.0,
"ref_think": "The plate and gripper position matches step 4…",
"score_think": "Compared to step 4, there is ~30% more lift…",
"score": 66.0
} |
An unanswerable text-based case is represented as:
1 2 3 4 5 6 7 8 9 10 11 |
{
"demo_type": "text_neg",
"steps": [...],
"progress": [...],
"observation": "obs_neg_txt_121.png",
"ref_gt": null,
"score_gt": "n/a",
"ref_think": "Demo describes stacking cups, but I see bowls instead.",
"score_think": "Cannot align any step; must abstain.",
"score": "n/a"
} |
5. Training and Reinforcement Learning Regimens
ProgressLM-45K is employed in a two-stage learning paradigm. Supervised fine-tuning employs a Qwen2.5-VL-3B base model, with LoRA rank=8 adapters applied to all linear layers. Training is conducted for 2 epochs, with a learning rate of , batch size 64 (accumulated), and cosine warmup (10%). Mixed precision (bfloat16) and Fully Sharded Data Parallel (FSDP) are used on 4H100 GPUs. The SFT objective is
Subsequent RL fine-tuning employs Generalized Reward Policy Optimization (GRPO) on 20,428 samples over 2 epochs (23 hours on 16H100), sampling rollouts per prompt and tokens. The composite reward is:
with , if , , and if the schema is valid.
6. Evaluation Protocols and Downstream Usage
ProgressLM-45K underpins the training of ProgressLM-3B, which is evaluated exclusively on Progress-Bench—a benchmark constructed from disjoint RoboMind trajectories. Key evaluation metrics include:
- Normalized Score Error (NSE): ,
- Progress Rank Correlation (PRC): Spearman’s across trajectory samples,
- Answerable False Rejection Rate (AFRR): rate of answerable samples incorrectly judged "N/A",
- Unanswerable Detection Accuracy (UDA): rate of unanswerable samples correctly detected as "N/A".
Following supervised and reinforcement learning on ProgressLM-45K, ProgressLM-3B demonstrates consistent state-of-the-art performance in fine-grained, uncertainty-calibrated progress estimation, including reliable handling of unanswerable scenarios and strong generalization to entirely unseen evaluation tasks (Zhang et al., 21 Jan 2026).
7. Context, Significance, and Prospects
ProgressLM-45K establishes a data-rich foundation for empirical study of progress reasoning in vision-LLMs—addressing a fundamental distinction between object-state recognition and dynamic trajectory estimation over extended horizons. Its construction enables both modality-agnostic and modality-specific calibration, intricate adversarial testing through unanswerable instances, and interpretable analysis via chain-of-thought traces. A plausible implication is that continued scaling and structuring of datasets in this style, with expanded task diversity and harder unanswerability cases, will yield VLMs exhibiting robust temporal reasoning, compositionality, and high-confidence abstention capabilities in real-world embodied agents (Zhang et al., 21 Jan 2026).