WorfBench: Evaluating LLM Workflow Generation

Updated 30 January 2026

WorfBench is a benchmark that assesses LLMs’ capability to convert open-ended tasks into directed acyclic graph (DAG) workflows.
The evaluation suite, WorfEval, uses semantic node matching, chain alignment, and graph overlap metrics to measure workflow quality.
Experimental findings show that even state-of-the-art models like GPT-4 reveal planning gaps, highlighting the need for improved agentic reasoning.

WorfBench is a unified benchmark for evaluating the workflow generation capabilities of LLMs in agentic scenarios involving complex, multi-step planning and reasoning. Standard functional or language understanding benchmarks do not assess the ability of LLMs to decompose open-ended instructions into executable workflows, particularly those with a non-linear structure such as directed acyclic graphs (DAGs). WorfBench addresses this deficiency by providing scenario diversity, graph-structured workflow representations, and an automated evaluation suite (WorfEval) that enables fine-grained, reproducible assessment of both sequence planning and full-graph planning in LLM-generated workflows (Qiao et al., 2024, 2505.19433).

1. Motivation and Scope

The increasing adoption of LLM agents for reasoning, tool use, and real-world automation requires robust evaluation of their ability to convert complex natural-language tasks into structured workflows. Existing workflow generation benchmarks have three major limitations:

Restricted coverage, focusing on function calls or toy examples and failing to capture the breadth of planning/reasoning tasks.
Simplistic workflow topology, evaluating only linear chains of subtasks rather than realistic DAGs supporting parallel and join dependencies.
Lax or inconsistent evaluation, often restricting assessment to holistic end-point accuracy, coarse human judgment, or lacking standardized, fine-grained metrics.

WorfBench directly addresses these limits by encompassing multiple real-world agentic scenarios, including function calling, embodied planning, problem-solving, and open-domain (web-grounded) planning. Workflows are modeled as DAGs and evaluated using a principled protocol (WorfEval) capable of quantifying both sequence planning and higher-order dependency planning (Qiao et al., 2024, 2505.19433).

2. Benchmark Design and Scenario Composition

WorfBench comprises four primary workflow categories corresponding to distinct agentic task domains:

Task Category	Example Scenario	Instance Count (train+test)
Function Calling	API/tool sequences	1,803
Embodied Planning	ALFWorld, robotics	4,048
Problem-Solving	Multi-step reasoning	4,257
Open-Grounded Tasks	WikiHow, web automation	2,281

Each workflow instance is defined as a DAG $G = (V, E)$ , where $V$ is a set of minimal-granularity subtasks (nodes) and $E$ encodes execution dependencies. Every scenario prompt is a natural-language task description, and LLMs must output a JSON graph specifying actionable steps and their inter-dependencies. Nodes are required to be executed in a feasible topological order from a START to an END node. Sequences of nodes forming valid chains and parallel branches allow the benchmark to probe both linear and non-linear planning competencies (Qiao et al., 2024, 2505.19433).

3. Evaluation Protocol: WorfEval

WorfEval is a multi-stage evaluation framework designed to measure both node-level and structural fidelity between LLM-predicted workflows and annotated ground-truth workflows. The main stages are:

A. Semantic Node Matching Predicted and reference nodes are encoded using Sentence-BERT, with a cosine similarity threshold (β=0.6) used to construct a max-weight bipartite matching, producing one-to-one correspondences between predicted and gold-step nodes.

B. Chain (Subsequence) Evaluation For up to 20 topological orderings of the gold DAG, the Longest Increasing Subsequence (LIS) in the predicted chain is computed. $F1_{\text{chain}} = \frac{2 p_{\text{chain}} r_{\text{chain}}}{p_{\text{chain}} + r_{\text{chain}}}$ where $p_{\text{chain}}, r_{\text{chain}}$ are precision and recall over LIS-aligned nodes.

C. Graph (Subgraph) Evaluation A Maximum Common Induced Subgraph (MCIS) is computed between the predicted and gold graphs (restricted to node matches). $F1_{\text{graph}} = \frac{2 p_{\text{graph}} r_{\text{graph}}}{p_{\text{graph}} + r_{\text{graph}}}$ where $p_{\text{graph}}, r_{\text{graph}}$ are computed over the MCIS node set.

This evaluation distinguishes a model’s ability to recover correct subtask ordering (chain) from its ability to model full workflow structure and dependencies (graph). Edge-level precision, recall, and F1 from WorfEval allow for reproducible, quantitative assessment at high resolution (Qiao et al., 2024, 2505.19433).

4. Experimental Findings and Model Diagnostics

WorfBench has facilitated comprehensive comparative studies across 18 LLM architectures (closed- and open-source). Key results include:

Closed-source model ceiling: GPT-4 achieves $F1_{\text{chain}}=67.3\%$ and $F1_{\text{graph}}=52.5\%$ on WorfBench. The persistent 15% gap between chain and graph planning marks a consistent structural reasoning barrier, even for state-of-the-art models.
Open-source model scaling: Larger open-source models (Qwen-2-72B $F1_{\text{graph}} \approx 58.5\%$ ) outperform smaller or older ones, but fine-tuned 7B models can surpass GPT-4 on in-domain tasks (Qwen-2-7B+FT $F1_{\text{graph}} \approx 70.4\%$ held-in).
Limited generalization: Fine-tuned open LLMs show strong held-in performance, but cross-scenario generalization (e.g., between function-calling and embodied planning) is modest. On held-out InterCodeSQL, even leading models drop to $F1_{\text{graph}}\sim48.7\%$ .
Compression robustness: Quantization (AWQ, GPTQ) and structured pruning (Wanda, SparseGPT) preserve workflow generation ( $<5\%$ loss), but naïve magnitude pruning collapses F1 completely. Activation-aware quantization may even improve F1 slightly.
Correlation diagnostics: Effective rank (eRank) of weight matrices and Top-K token ranking consistency post-compression predict downstream workflow F1, serving as proxies for compression quality (2505.19433).

5. Practical Impact and Downstream Applications

WorfBench enables assessment of the practical utility of LLM-generated workflows for downstream tools and agents. Key observations:

Embodied Planning: Supplying LLM-generated workflows to planners raises ALFWorld task success by +14–19 percentage points (seen/unseen splits) for GPT-4.
Function Calling: Using workflow nodes as prompts for chain-of-thought (CoT) significantly increases function-call accuracy over one-shot inference on StableToolBench.
Inference Efficiency: Exploiting DAG structure for parallel execution reduces end-to-end inference latency by 20–33% compared to linear agents.
Strategy Efficiency: Explicit planning via workflows decreases the average number of required vertical “trial-and-error” reasoning steps.

These results illustrate that even imperfect LLM-planned workflows yield material benefits for both immediate downstream accuracy and computational efficiency (Qiao et al., 2024).

6. Open Challenges and Future Directions

While WorfBench imposes strict evaluation via topological consistency and human validation, it remains focused on natural-language workflow generation. The current protocol assumes every node must be executed (no conditionals or optional branches), and it does not support formal planning languages such as PDDL. Iterative or interactive workflow refinement by LLM agents is not yet evaluated—current workflows are generated one-pass, with no feedback from intermediate observations. Tackling these challenges would bring the evaluation closer to real-world agentic scenarios.

Further, significant sequence–graph planning gaps persist at the frontier of LLM capabilities, and cross-domain generalization remains limited. Deep integration of explicit world knowledge and commonsense reasoning is likely required to close these gaps in graph-structured agentic planning. Recent research (e.g., KoLA, World Knowledge Modeling) may offer pathways for improvement. Continuous validation of distilled agents is essential, as language modeling or natural language understanding benchmarks do not reliably indicate preservation of workflow skills after compression (2505.19433, Qiao et al., 2024).

7. Recommended Usage and Toolchains

WorfBench resources are openly available (https://github.com/zjunlp/WorfBench; https://github.com/pprp/ACBench). The standard process involves:

Selecting scenarios and generating evaluation prompts.
Instructing LLMs to generate JSON-encoded workflow DAGs in response.
Parsing outputs and running WorfEval for granular F1 assessment.
Optionally profiling sensitivity to model type (size, training, fine-tuning) and compression method (quantization/pruning), aided by eRank or Top-K consistency as early diagnostic signals.

This workflow offers a reproducible, standardized environment for benchmarking advances in agentic reasoning and workflow generation.

Markdown Report Issue Upgrade to Chat

References (2)

Benchmarking Agentic Workflow Generation (2024)

Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WorfBench.