Papers
Topics
Authors
Recent
Search
2000 character limit reached

ProAct-75 Benchmark

Updated 19 February 2026
  • The paper introduces ProAct-75, a dataset of 5,383 videos and 91,581 atomic action-step annotations structured by explicit directed acyclic graphs (DAGs).
  • It evaluates proactive agents through metrics like trigger detection mF1, Saved Steps, and Parallel Action Rate, highlighting improvements over state-of-the-art systems.
  • ProAct-75 is designed for assistance, maintenance, and safety monitoring, enabling research into agents that can plan interventions using serial and parallel procedural structures.

ProAct-75 is a large-scale benchmark developed to support the training and evaluation of structure-aware proactive agents—systems that, in contrast to passive agents, determine when and how to intervene in real-world processes to assist, maintain, or ensure safety. The benchmark provides a comprehensive, multimodal dataset of 75 tasks, each annotated at the atomic action-step level and formalized with explicit directed acyclic graphs (DAGs) representing procedural dependencies and opportunities for concurrent (parallel-threaded) execution. This enables quantitative assessment of agents beyond imitation, focusing on their ability to reason about task structure, initiate timely interventions, and conduct parallel actions (Zhu et al., 3 Feb 2026).

1. Dataset Structure and Composition

ProAct-75 encompasses three proactive-response domains: assistance (human-initiated objectives), maintenance (environment-triggered interventions), and safety monitoring (risk-aversion actions). The dataset covers 75 unique tasks sourced from exocentric videos (Ego-Exo4D, COIN, UCF-Crime) supplemented by 495 newly collected clips to balance coverage across activities. Statistical composition includes 5,383 videos and 91,581 atomic action-step segments, each paired with an explicit task graph per task.

The data is split approximately 3:1 into training (Ntrain=4,074N_{train} = 4,074 videos) and test (Ntest=1,309N_{test} = 1,309 videos). For a "best-view" evaluation (one camera view per scene), there are Ntrain=1,905N_{train} = 1,905 and Ntest=516N_{test} = 516; remaining views constitute an out-of-distribution test set. Each action-step is annotated with a timestamp span [tstart,tend)[t_{start}, t_{end}), a natural language label, and a trigger flag yttrig{0,1}y_t^{trig} \in \{0,1\} denoting intervention salience.

Table: Dataset Statistics

Attribute Value
Number of tasks 75
Total videos 5,383
Total step annotations 91,581
Domains Assistance, Maintenance, Safety Monitoring
Split: Train/Test (videos) 4,074 / 1,309
Task graph per task Yes (AND/OR DAG, multiple execution threads)

2. Task Graph Formalism

Each ProAct-75 task TT is formalized as a DAG T=(V,E)T = (V, E), where VV is a set of nodes and EE the set of directed edges encoding temporal dependencies. Nodes partition into executable steps VeV_e and structural non-executable nodes VnV_n, with VeVn=V_e \cap V_n = \varnothing. Directed edges (uv)E(u \rightarrow v)\in E enforce ordering (uu must complete before vv), and reachability is recursively defined:

aReach(b)    [a=bc:(ca)EcReach(b)].a \in \mathrm{Reach}(b) \iff [a = b \vee \exists c : (c \rightarrow a) \in E \land c \in \mathrm{Reach}(b)].

Each node vv is assigned a type ϕ(v){AND,OR}\phi(v) \in \{\mathrm{AND}, \mathrm{OR}\}, controlling execution dependencies:

  • AND-node: executes when all predecessors have executed (Pred(v)Progt\mathrm{Pred}(v) \subseteq \mathrm{Prog}_t)
  • OR-node: executes when any predecessor has executed (Pred(v)Progt\mathrm{Pred}(v) \cap \mathrm{Prog}_t \neq \varnothing)

Here, Progt\mathrm{Prog}_t is the set of executed nodes at time tt. Legal next actions at time tt form the set Atlegal={aVeProgt  preconditions satisfied}A_t^{legal} = \{ a \in V_e \setminus \mathrm{Prog}_t\ |\ \text{preconditions satisfied}\}.

Mid-level start/end structural nodes induce execution threads. Branches between such nodes are mapped to threads via a mapping π:VN\pi: V \rightarrow \mathbb{N}, allowing explicit modeling of task parallelism (i.e., actions on distinct threads proceed concurrently when dependencies allow).

3. Annotation Scheme and Agent Outputs

Each atomic step vVev \in V_e receives:

  • Frame span [tstart,tend)[t_{start}, t_{end})
  • Natural language label (e.g., "Tie the bag")
  • Trigger flag yttrig{0,1}y_t^{trig} \in \{0,1\}

At each agent decision window, outputs required are:

  • Trigger prediction yttrigy_t^{trig}
  • Task prediction yttask{1,,75}{other}y_t^{task} \in \{1,\dots,75\} \cup \{\text{other}\}
  • Step prediction ytstepVy_t^{step} \in V
  • Future step sequence y^t+1:t+n\hat{y}_{t+1:t+n}
  • Proactive action at+1a_{t+1}^*

This annotation schema supports fine-grained evaluation of not only "what" the agent does, but also "when" and "which thread" actions are prioritized, under procedural constraints of the DAG.

4. Evaluation Metrics

ProAct-75 employs distinct metrics for key challenges in proactive response:

  • Trigger Detection: Macro-averaged F1 (mF1) and accuracy, with mF1 averaged over both classes (mF1=(1/2)(F1(neg)+F1(pos))\text{mF}_1 = (1/2)(F_1(\text{neg}) + F_1(\text{pos}))).
  • Proactive Action Selection:
    • Saved Steps (SS): For each video ii, with BiB_i total human steps and HiH_i human steps remaining post-intervention, Si=BiHiS_i = B_i - H_i and SS=1NiSiSS = \frac{1}{N}\sum_i S_i. For one-step online inference, SS=1Nsj1{robot aj matches ground-truth}SS = \frac{1}{N_s} \sum_j \mathbf{1}\{\text{robot } a_j \text{ matches ground-truth}\}.
    • Parallel Action Rate (PA): With NRN^R total robot actions and set PP of actions advancing a new thread (π(a)π(hprev)\pi(a) \neq \pi(h_{prev})), PA=P/NRPA = |P|/N^R.
    • Thread-mixing entropy: For candidate action aa, the mixing ratio pkp_k for each thread kk is pk=nkhum/(nkhum+nkrob)p_k = n_k^{hum}/(n_k^{hum} + n_k^{rob}), with entropy Hk(pk)=pklogpk(1pk)log(1pk)H_k(p_k) = -p_k\log p_k - (1-p_k)\log(1-p_k). The aggregate thread-mixing entropy is Hmix=kwkHk(pk)H_{\text{mix}} = \sum_k w_k H_k(p_k) where wkw_k normalizes thread activity. Proactive agents are selected to minimize HmixH_{\text{mix}}.

These metrics align evaluation with the structural and temporal aspects inherent in proactive procedural assistance.

5. ProAct-Helper Framework and Methodology

The ProAct-Helper serves as a reference architecture based on a multimodal LLM (LLM, Qwen2.5-VL-Instruct, 3B/7B parameters), fine-tuned using LoRA and an instruction-tuning regime targeting three objectives:

  • LCEL_{CE}: Standard autoregressive cross-entropy
  • LtrigL_{trig}: Binary classification for trigger tokens
  • LbindL_{bind}: Hierarchical Binding Module (HBM) loss

The total loss is L=LCE+λtrigLtrig+λbindLbindL = L_{CE} + \lambda_{trig} L_{trig} + \lambda_{bind} L_{bind}. The HBM mitigates data imbalance in trigger→task→step prediction via cross-level InfoNCE contrastive binding, increasing discriminability between hierarchical outputs.

Input at each timestep tt consists of a 5-frame sliding window of keyframes, processed in two prompt stages:

  1. Prediction of {is_trigger,task}\{\text{is\_trigger}, \text{task}\}
  2. If triggered, prediction of {current_step,5 future_steps,priority scores}\{\text{current\_step}, 5 \text{ future\_steps}, \text{priority scores}\}

Entropy-driven heuristic search is then applied: candidate actions AtcandA_t^{cand} (filtered legal next steps) are ranked by minimizing thread-mixing entropy HmixH_{mix}, with a lexicographic tie-break using predicted future step positions.

A core property is explicit support for parallel thread execution—the agent may select an action advancing a distinct procedural thread, enabling concurrent progress instead of naïvely mirroring the human’s immediate next step.

6. Experimental Results

On the ProAct-75 test set, ProAct-Helper (7B) achieves superior performance compared to closed-source SoTA (Gemini-2.5-Pro):

  • Trigger detection mF1: +6.21+6.21 ppt (from 69.3975.6069.39 \rightarrow 75.60)
  • Step detection F1: +11.72+11.72 ppt (17.4129.1317.41 \rightarrow 29.13)
  • Task detection F1: +14.47+14.47 ppt (52.1166.5852.11 \rightarrow 66.58)
  • Saved Steps (SS): +0.25+0.25 (0.1110.3610.111 \rightarrow 0.361)
  • Parallel Action Rate (PA): +15.58+15.58 ppt (18.37%33.95%18.37\% \rightarrow 33.95\%)

Ablation studies indicate that incorporating Ltrig2taskL_{trig2task} improves task mF1 by +3.60+3.60 ppt and step mF1 by +3.15+3.15 ppt, while adding Ltask2stepL_{task2step} yields +4.10/+3.29+4.10/+3.29 ppt. Full HBM produces the highest gains.

7. Significance, Applications, and Future Directions

ProAct-75 constitutes the first large-scale, step-level video benchmark pairing AND/OR DAGs (serial and parallel dependencies) with harmonized trigger\rightarrowtask\rightarrowstep annotation, covering diverse proactive domains. This enables rigorous research into agents that reason about intervention timing and choice, grounded in explicit procedural structure.

Applications encompass household assistants (e.g., trash-bag replacement, appliance maintenance), industrial collaboration (assembly support, tool handoff), and safety monitoring (risk mitigation, procedural oversight).

Foreseeable research avenues include:

  • Integration of learned graph-feasible decoding within agent generation loops
  • Reinforcement-learning or search exploiting explicit DAG structure
  • Expansion to open-world tasks with dynamic graph evolution and thread variability
  • Cross-domain continual and few-shot adaptation leveraging ProAct-75’s generality

By uniting multimodal perception with structured procedural graphs, the benchmark and baseline delineate a principled pathway toward the design of agents capable of understanding and co-executing complex human workflows (Zhu et al., 3 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ProAct-75 Benchmark.