ProAct-75 Benchmark

Updated 19 February 2026

The paper introduces ProAct-75, a dataset of 5,383 videos and 91,581 atomic action-step annotations structured by explicit directed acyclic graphs (DAGs).
It evaluates proactive agents through metrics like trigger detection mF1, Saved Steps, and Parallel Action Rate, highlighting improvements over state-of-the-art systems.
ProAct-75 is designed for assistance, maintenance, and safety monitoring, enabling research into agents that can plan interventions using serial and parallel procedural structures.

ProAct-75 is a large-scale benchmark developed to support the training and evaluation of structure-aware proactive agents—systems that, in contrast to passive agents, determine when and how to intervene in real-world processes to assist, maintain, or ensure safety. The benchmark provides a comprehensive, multimodal dataset of 75 tasks, each annotated at the atomic action-step level and formalized with explicit directed acyclic graphs (DAGs) representing procedural dependencies and opportunities for concurrent (parallel-threaded) execution. This enables quantitative assessment of agents beyond imitation, focusing on their ability to reason about task structure, initiate timely interventions, and conduct parallel actions (Zhu et al., 3 Feb 2026).

1. Dataset Structure and Composition

ProAct-75 encompasses three proactive-response domains: assistance (human-initiated objectives), maintenance (environment-triggered interventions), and safety monitoring (risk-aversion actions). The dataset covers 75 unique tasks sourced from exocentric videos (Ego-Exo4D, COIN, UCF-Crime) supplemented by 495 newly collected clips to balance coverage across activities. Statistical composition includes 5,383 videos and 91,581 atomic action-step segments, each paired with an explicit task graph per task.

The data is split approximately 3:1 into training ( $N_{train} = 4,074$ videos) and test ( $N_{test} = 1,309$ videos). For a "best-view" evaluation (one camera view per scene), there are $N_{train} = 1,905$ and $N_{test} = 516$ ; remaining views constitute an out-of-distribution test set. Each action-step is annotated with a timestamp span $[t_{start}, t_{end})$ , a natural language label, and a trigger flag $y_t^{trig} \in \{0,1\}$ denoting intervention salience.

Table: Dataset Statistics

Attribute	Value
Number of tasks	75
Total videos	5,383
Total step annotations	91,581
Domains	Assistance, Maintenance, Safety Monitoring
Split: Train/Test (videos)	4,074 / 1,309
Task graph per task	Yes (AND/OR DAG, multiple execution threads)

2. Task Graph Formalism

Each ProAct-75 task $T$ is formalized as a DAG $T = (V, E)$ , where $V$ is a set of nodes and $E$ the set of directed edges encoding temporal dependencies. Nodes partition into executable steps $V_e$ and structural non-executable nodes $V_n$ , with $V_e \cap V_n = \varnothing$ . Directed edges $(u \rightarrow v)\in E$ enforce ordering ( $u$ must complete before $v$ ), and reachability is recursively defined:

$a \in \mathrm{Reach}(b) \iff [a = b \vee \exists c : (c \rightarrow a) \in E \land c \in \mathrm{Reach}(b)].$

Each node $v$ is assigned a type $\phi(v) \in \{\mathrm{AND}, \mathrm{OR}\}$ , controlling execution dependencies:

AND-node: executes when all predecessors have executed ( $\mathrm{Pred}(v) \subseteq \mathrm{Prog}_t$ )
OR-node: executes when any predecessor has executed ( $\mathrm{Pred}(v) \cap \mathrm{Prog}_t \neq \varnothing$ )

Here, $\mathrm{Prog}_t$ is the set of executed nodes at time $t$ . Legal next actions at time $t$ form the set $A_t^{legal} = \{ a \in V_e \setminus \mathrm{Prog}_t\ |\ \text{preconditions satisfied}\}$ .

Mid-level start/end structural nodes induce execution threads. Branches between such nodes are mapped to threads via a mapping $\pi: V \rightarrow \mathbb{N}$ , allowing explicit modeling of task parallelism (i.e., actions on distinct threads proceed concurrently when dependencies allow).

3. Annotation Scheme and Agent Outputs

Each atomic step $v \in V_e$ receives:

Frame span $[t_{start}, t_{end})$
Natural language label (e.g., "Tie the bag")
Trigger flag $y_t^{trig} \in \{0,1\}$

At each agent decision window, outputs required are:

Trigger prediction $y_t^{trig}$
Task prediction $y_t^{task} \in \{1,\dots,75\} \cup \{\text{other}\}$
Step prediction $y_t^{step} \in V$
Future step sequence $\hat{y}_{t+1:t+n}$
Proactive action $a_{t+1}^*$

This annotation schema supports fine-grained evaluation of not only "what" the agent does, but also "when" and "which thread" actions are prioritized, under procedural constraints of the DAG.

4. Evaluation Metrics

ProAct-75 employs distinct metrics for key challenges in proactive response:

Trigger Detection: Macro-averaged F1 (mF1) and accuracy, with mF1 averaged over both classes ( $\text{mF}_1 = (1/2)(F_1(\text{neg}) + F_1(\text{pos}))$ ).
Proactive Action Selection:
- Saved Steps (SS): For each video $i$ , with $B_i$ total human steps and $H_i$ human steps remaining post-intervention, $S_i = B_i - H_i$ and $SS = \frac{1}{N}\sum_i S_i$ . For one-step online inference, $SS = \frac{1}{N_s} \sum_j \mathbf{1}\{\text{robot } a_j \text{ matches ground-truth}\}$ .
- Parallel Action Rate (PA): With $N^R$ total robot actions and set $P$ of actions advancing a new thread ( $\pi(a) \neq \pi(h_{prev})$ ), $PA = |P|/N^R$ .
- Thread-mixing entropy: For candidate action $a$ , the mixing ratio $p_k$ for each thread $k$ is $p_k = n_k^{hum}/(n_k^{hum} + n_k^{rob})$ , with entropy $H_k(p_k) = -p_k\log p_k - (1-p_k)\log(1-p_k)$ . The aggregate thread-mixing entropy is $H_{\text{mix}} = \sum_k w_k H_k(p_k)$ where $w_k$ normalizes thread activity. Proactive agents are selected to minimize $H_{\text{mix}}$ .

These metrics align evaluation with the structural and temporal aspects inherent in proactive procedural assistance.

5. ProAct-Helper Framework and Methodology

The ProAct-Helper serves as a reference architecture based on a multimodal LLM (LLM, Qwen2.5-VL-Instruct, 3B/7B parameters), fine-tuned using LoRA and an instruction-tuning regime targeting three objectives:

$L_{CE}$ : Standard autoregressive cross-entropy
$L_{trig}$ : Binary classification for trigger tokens
$L_{bind}$ : Hierarchical Binding Module (HBM) loss

The total loss is $L = L_{CE} + \lambda_{trig} L_{trig} + \lambda_{bind} L_{bind}$ . The HBM mitigates data imbalance in trigger→task→step prediction via cross-level InfoNCE contrastive binding, increasing discriminability between hierarchical outputs.

Input at each timestep $t$ consists of a 5-frame sliding window of keyframes, processed in two prompt stages:

Prediction of $\{\text{is\_trigger}, \text{task}\}$
If triggered, prediction of $\{\text{current\_step}, 5 \text{ future\_steps}, \text{priority scores}\}$

Entropy-driven heuristic search is then applied: candidate actions $A_t^{cand}$ (filtered legal next steps) are ranked by minimizing thread-mixing entropy $H_{mix}$ , with a lexicographic tie-break using predicted future step positions.

A core property is explicit support for parallel thread execution—the agent may select an action advancing a distinct procedural thread, enabling concurrent progress instead of naïvely mirroring the human’s immediate next step.

6. Experimental Results

On the ProAct-75 test set, ProAct-Helper (7B) achieves superior performance compared to closed-source SoTA (Gemini-2.5-Pro):

Trigger detection mF1: $+6.21$ ppt (from $69.39 \rightarrow 75.60$ )
Step detection F1: $+11.72$ ppt ( $17.41 \rightarrow 29.13$ )
Task detection F1: $+14.47$ ppt ( $52.11 \rightarrow 66.58$ )
Saved Steps (SS): $+0.25$ ( $0.111 \rightarrow 0.361$ )
Parallel Action Rate (PA): $+15.58$ ppt ( $18.37\% \rightarrow 33.95\%$ )

Ablation studies indicate that incorporating $L_{trig2task}$ improves task mF1 by $+3.60$ ppt and step mF1 by $+3.15$ ppt, while adding $L_{task2step}$ yields $+4.10/+3.29$ ppt. Full HBM produces the highest gains.

7. Significance, Applications, and Future Directions

ProAct-75 constitutes the first large-scale, step-level video benchmark pairing AND/OR DAGs (serial and parallel dependencies) with harmonized trigger $\rightarrow$ task $\rightarrow$ step annotation, covering diverse proactive domains. This enables rigorous research into agents that reason about intervention timing and choice, grounded in explicit procedural structure.

Applications encompass household assistants (e.g., trash-bag replacement, appliance maintenance), industrial collaboration (assembly support, tool handoff), and safety monitoring (risk mitigation, procedural oversight).

Foreseeable research avenues include:

Integration of learned graph-feasible decoding within agent generation loops
Reinforcement-learning or search exploiting explicit DAG structure
Expansion to open-world tasks with dynamic graph evolution and thread variability
Cross-domain continual and few-shot adaptation leveraging ProAct-75’s generality

By uniting multimodal perception with structured procedural graphs, the benchmark and baseline delineate a principled pathway toward the design of agents capable of understanding and co-executing complex human workflows (Zhu et al., 3 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

ProAct: A Benchmark and Multimodal Framework for Structure-Aware Proactive Response (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ProAct-75 Benchmark.