ProAct-75 Benchmark
- The paper introduces ProAct-75, a dataset of 5,383 videos and 91,581 atomic action-step annotations structured by explicit directed acyclic graphs (DAGs).
- It evaluates proactive agents through metrics like trigger detection mF1, Saved Steps, and Parallel Action Rate, highlighting improvements over state-of-the-art systems.
- ProAct-75 is designed for assistance, maintenance, and safety monitoring, enabling research into agents that can plan interventions using serial and parallel procedural structures.
ProAct-75 is a large-scale benchmark developed to support the training and evaluation of structure-aware proactive agents—systems that, in contrast to passive agents, determine when and how to intervene in real-world processes to assist, maintain, or ensure safety. The benchmark provides a comprehensive, multimodal dataset of 75 tasks, each annotated at the atomic action-step level and formalized with explicit directed acyclic graphs (DAGs) representing procedural dependencies and opportunities for concurrent (parallel-threaded) execution. This enables quantitative assessment of agents beyond imitation, focusing on their ability to reason about task structure, initiate timely interventions, and conduct parallel actions (Zhu et al., 3 Feb 2026).
1. Dataset Structure and Composition
ProAct-75 encompasses three proactive-response domains: assistance (human-initiated objectives), maintenance (environment-triggered interventions), and safety monitoring (risk-aversion actions). The dataset covers 75 unique tasks sourced from exocentric videos (Ego-Exo4D, COIN, UCF-Crime) supplemented by 495 newly collected clips to balance coverage across activities. Statistical composition includes 5,383 videos and 91,581 atomic action-step segments, each paired with an explicit task graph per task.
The data is split approximately 3:1 into training ( videos) and test ( videos). For a "best-view" evaluation (one camera view per scene), there are and ; remaining views constitute an out-of-distribution test set. Each action-step is annotated with a timestamp span , a natural language label, and a trigger flag denoting intervention salience.
Table: Dataset Statistics
| Attribute | Value |
|---|---|
| Number of tasks | 75 |
| Total videos | 5,383 |
| Total step annotations | 91,581 |
| Domains | Assistance, Maintenance, Safety Monitoring |
| Split: Train/Test (videos) | 4,074 / 1,309 |
| Task graph per task | Yes (AND/OR DAG, multiple execution threads) |
2. Task Graph Formalism
Each ProAct-75 task is formalized as a DAG , where is a set of nodes and the set of directed edges encoding temporal dependencies. Nodes partition into executable steps and structural non-executable nodes , with . Directed edges enforce ordering ( must complete before ), and reachability is recursively defined:
Each node is assigned a type , controlling execution dependencies:
- AND-node: executes when all predecessors have executed ()
- OR-node: executes when any predecessor has executed ()
Here, is the set of executed nodes at time . Legal next actions at time form the set .
Mid-level start/end structural nodes induce execution threads. Branches between such nodes are mapped to threads via a mapping , allowing explicit modeling of task parallelism (i.e., actions on distinct threads proceed concurrently when dependencies allow).
3. Annotation Scheme and Agent Outputs
Each atomic step receives:
- Frame span
- Natural language label (e.g., "Tie the bag")
- Trigger flag
At each agent decision window, outputs required are:
- Trigger prediction
- Task prediction
- Step prediction
- Future step sequence
- Proactive action
This annotation schema supports fine-grained evaluation of not only "what" the agent does, but also "when" and "which thread" actions are prioritized, under procedural constraints of the DAG.
4. Evaluation Metrics
ProAct-75 employs distinct metrics for key challenges in proactive response:
- Trigger Detection: Macro-averaged F1 (mF1) and accuracy, with mF1 averaged over both classes ().
- Proactive Action Selection:
- Saved Steps (SS): For each video , with total human steps and human steps remaining post-intervention, and . For one-step online inference, .
- Parallel Action Rate (PA): With total robot actions and set of actions advancing a new thread (), .
- Thread-mixing entropy: For candidate action , the mixing ratio for each thread is , with entropy . The aggregate thread-mixing entropy is where normalizes thread activity. Proactive agents are selected to minimize .
These metrics align evaluation with the structural and temporal aspects inherent in proactive procedural assistance.
5. ProAct-Helper Framework and Methodology
The ProAct-Helper serves as a reference architecture based on a multimodal LLM (LLM, Qwen2.5-VL-Instruct, 3B/7B parameters), fine-tuned using LoRA and an instruction-tuning regime targeting three objectives:
- : Standard autoregressive cross-entropy
- : Binary classification for trigger tokens
- : Hierarchical Binding Module (HBM) loss
The total loss is . The HBM mitigates data imbalance in trigger→task→step prediction via cross-level InfoNCE contrastive binding, increasing discriminability between hierarchical outputs.
Input at each timestep consists of a 5-frame sliding window of keyframes, processed in two prompt stages:
- Prediction of
- If triggered, prediction of
Entropy-driven heuristic search is then applied: candidate actions (filtered legal next steps) are ranked by minimizing thread-mixing entropy , with a lexicographic tie-break using predicted future step positions.
A core property is explicit support for parallel thread execution—the agent may select an action advancing a distinct procedural thread, enabling concurrent progress instead of naïvely mirroring the human’s immediate next step.
6. Experimental Results
On the ProAct-75 test set, ProAct-Helper (7B) achieves superior performance compared to closed-source SoTA (Gemini-2.5-Pro):
- Trigger detection mF1: ppt (from )
- Step detection F1: ppt ()
- Task detection F1: ppt ()
- Saved Steps (SS): ()
- Parallel Action Rate (PA): ppt ()
Ablation studies indicate that incorporating improves task mF1 by ppt and step mF1 by ppt, while adding yields ppt. Full HBM produces the highest gains.
7. Significance, Applications, and Future Directions
ProAct-75 constitutes the first large-scale, step-level video benchmark pairing AND/OR DAGs (serial and parallel dependencies) with harmonized triggertaskstep annotation, covering diverse proactive domains. This enables rigorous research into agents that reason about intervention timing and choice, grounded in explicit procedural structure.
Applications encompass household assistants (e.g., trash-bag replacement, appliance maintenance), industrial collaboration (assembly support, tool handoff), and safety monitoring (risk mitigation, procedural oversight).
Foreseeable research avenues include:
- Integration of learned graph-feasible decoding within agent generation loops
- Reinforcement-learning or search exploiting explicit DAG structure
- Expansion to open-world tasks with dynamic graph evolution and thread variability
- Cross-domain continual and few-shot adaptation leveraging ProAct-75’s generality
By uniting multimodal perception with structured procedural graphs, the benchmark and baseline delineate a principled pathway toward the design of agents capable of understanding and co-executing complex human workflows (Zhu et al., 3 Feb 2026).