WorFBench: LLM Workflow Generation Benchmark

Updated 19 January 2026

WorFBench is a benchmark designed to assess LLMs' capability to decompose complex tasks into executable directed acyclic graph (DAG) workflows.
It employs rigorous metrics—including node, chain, and graph matching—to quantify planning accuracy and diagnose error modes in workflow generation.
Experimental results show that incorporating workflow planning improves agent performance and helps evaluate LLM compression impacts in real-world applications.

WorFBench is a unified benchmark and evaluation suite specifically designed to assess and quantify the workflow generation capabilities of LLMs in agentic and reasoning-intensive contexts. It provides a rigorous, multi-scenario environment for measuring how well LLMs can decompose complex tasks into executable, graph-structured workflows, emphasizing both serial and parallel dependencies. WorFBench forms a core component of the Agent Compression Benchmark (ACBench) to evaluate the impact of post-training LLM compression on agentic abilities including workflow planning, tool use, long-context understanding, and real-world task integration (2505.19433, Qiao et al., 2024).

1. Scope and Problem Definition

WorFBench addresses the limitations of prior workflow evaluation resources, which predominantly targeted linear, function-call centric, or natural language understanding tasks using simplistic metrics such as perplexity or single-turn accuracy. Unlike these benchmarks, WorFBench exercises LLMs across multi-turn planning, graph-structured decomposition, and scenarios requiring the integration of tool-use and long-horizon reasoning. Its core objective is to systematically evaluate an LLM’s ability to generate workflows as directed acyclic graphs (DAGs) that encapsulate minimum-executable subtask granularity and complex interdependencies. This multi-faceted benchmark covers:

Function-call tasks: Generation of API or function call sequences (1,803 examples).
Embodied tasks: Multi-step planning in simulated environments (4,048 examples).
Problem-solving tasks: Structured decomposition of complex reasoning problems (4,257 examples).
Open-grounded tasks: Planning that interleaves external tool use, free-form instructions, and heterogeneous constraints (2,281 examples).

WorFBench is incorporated as the workflow generation capability within ACBench, which also covers action execution (tool use/function-calling), long-context understanding, and real-world applications (2505.19433, Qiao et al., 2024).

2. Dataset Construction and Task Formulation

WorFBench data construction is based on the framework detailed in Qiao et al., 2024, with a total of approximately 18,000 training examples, 2,146 test examples, and 723 held-out tasks for generalization evaluation. Each sample consists of a natural-language instruction (optionally with situation or environment context) accompanied by a target workflow represented programmatically as a JSON-encoded DAG: nodes correspond to discrete actions or reasoning steps; edges represent control or data dependencies. Task prompts are templated (e.g., “You are an agent… plan steps to…”) and require LLMs to output workflows comprising 3–10 actions. Workflow complexity is systematically varied, spanning simple linear chains (function-calling) to multi-branching and deeply nested graphs encountered in embodied and open-grounded scenarios (2505.19433, Qiao et al., 2024).

Workflows in the dataset exhibit both topological diversity and realistic task compositions, including serial and parallel subtask structures, derived from domains such as API orchestration (ToolBench, ToolAlpaca), mathematical and commonsense multi-step reasoning (Lumos-O), embodied household/web navigation (ALFWorld, WebShop, OS), and complex procedural instructions (WikiHow).

3. Evaluation Metrics and Protocols

The evaluation protocol is implemented through the WorFEval engine, which rigorously aligns predicted and gold-standard workflows via semantic and structural matching:

Node Matching: Correspondence between predicted and gold nodes is established through bipartite maximum-weight matching with Sentence-BERT-based semantic similarity, thresholded at β=0.6.
Subsequence (Node Chain) Matching: Multiple (≤20) topological orders of the gold workflow are considered to extract the maximal length of matched action sequences. Precision, recall, and F1 (F1_chain) are computed as follows:

$p_{chain} = \frac{l}{|V^p|},\quad r_{chain} = \frac{l}{|V^g|},\quad F1_{chain} = \frac{2p_{chain}r_{chain}}{p_{chain}+r_{chain}}$

where $l$ is the length of the maximal matched subsequence.

Subgraph (Workflow Graph) Matching: An MCIS is computed over the matched node subgraphs, yielding analogous precision, recall, and F1 (F1_graph).
General Compression Metrics (via ACBench): To systematically study LLM compression effects, three additional metrics are employed:
- Efficient Rank (eRank): Assesses effective dimensionality of logits/weights.
- Top-k Ranking Correlation: Measures token rank preservation post-compression, using the Spearman correlation $\rho_k$ .
- Energy-based Metric: Tracks the absolute difference in negative log-sum-exp confidence ( $\Delta_E$ ) between original and compressed logits.

These protocols are geared to expose differences in sequential versus graph-structured planning, providing precise quantification of agentic workflow generation (Qiao et al., 2024, 2505.19433).

4. Experimental Results

Comprehensive benchmarking across GPT-4, GPT-3.5-turbo, Claude-3.5, Qwen-2, Llama-2/3, Vicuna, Mistral, Mixtral, Phi-3, GLM-4, InternLM-2.5, WizardLM, and fine-tuned 7B–72B open models establishes characteristic performance on WorFBench:

Overall accuracy: Across 18 models, F1_chain ranges from 60%–75%, F1_graph from 35%–62%, with chain planning consistently outperforming graph planning by 15–20 percentage points (even GPT-4: F1_chain=74.9%, F1_graph=62.1%; Δ≈13pp).
Scaling law: Larger models trend toward higher F1 due to increased knowledge retrieval and relational reasoning capacity. Recent well-fine-tuned 7B models can surpass older 13B checkpoints.
Generalization: While GPT-4 achieves high F1 on held-out function-calling (Seal-Tools, F1_chain=96.6%, F1_graph=80.3%), open models show limited transfer to distinct domains (e.g., InterCodeSQL), suggesting data fit dominates zero-shot generalization (Qiao et al., 2024).
Workflow complexity: Performance declines with increasing graph size/edge count. For workflows ≥10 steps, F1_chain falls below 65%, F1_graph below 50%, indicating persisting limitations for long-horizon, multi-branching tasks.
Error modes: Analysis of samples with F1_graph<0.5 shows 35% granularity mismatches, 30% explicitness (vague subtasks), 25% graph structure errors, and 10% formatting violations.
Compression sensitivity: Under quantization (GPTQ-INT8, AWQ-INT4), F1 degrades by <5%; unstructured pruning via Wanda or SparseGPT causes similar mild losses. However, 2:4 and magnitude-based pruning often collapses output structure (F1→0). Smaller models (≤7B) are significantly more fragile than large models (≥32B) with aggressive compression (2505.19433).

5. Practical Applications and Downstream Impact

WorFBench enables actionable diagnosis of planning and workflow decomposition weaknesses in LLMs and demonstrates practical benefits as an inductive prior for diverse agentic tasks:

Workflow-augmented agents: Injecting explicit workflow planning output from advanced models as prior structure improves end-to-end agent execution. For instance, GPT-4 + WorFBench workflows improve ALFWorld task success rates by 13.6–18.6pp over vanilla GPT-4; similar boosts are observed for Llama and Qwen-2 models (Qiao et al., 2024).
Enhanced chain-of-thought (CoT): Workflow-augmented planning increases function-call accuracy by 3–7pp relative to one-shot CoT, both for large open models and GPT-4.
Efficiency gains: Identifying critical paths and exploiting task parallelism in generated workflows yield 20–35% reductions in average downstream inference time.
Shortened planning: Average number of required planning steps is reduced by integrating workflow knowledge, further streamlining agent behavior.
Compression for edge deployment: Best-practice compression (AWQ-INT4, GPTQ-INT8, Wanda unstructured pruning ≤50%) enables LLM workflow planning on constrained devices with ≤5% F1 loss for models ≥7B. Validation via Top-k ranking correlation and $\Delta_E$ is recommended prior to downstream deployment (2505.19433).

6. Implementation Resources and Community Adoption

WorFBench supplies a complete suite including dataset, prompt templates, flow-graph annotation protocols, and the WorFEval scoring engine. Resources are publicly available under a MIT-style license at https://github.com/zjunlp/[WorfBench](https://www.emergentmind.com/topics/worfbench) and https://github.com/pprp/ACBench. The evaluation scripts support metric automation and large-scale benchmarking for both academic model evaluation and real-world agentic system deployment (Qiao et al., 2024, 2505.19433).

WorFBench has catalyzed comparative analysis of closed-source and open-source LLMs, and serves as a standard for principled evaluation of workflow generation, both for base modeling and for assessing trade-offs and strategies in LLM post-training compression. The growing gap between chain and graph planning accuracy, and the clear benefits of workflow-planning priors for end-to-end agent performance, foreground the ongoing need for advances in agentic LLM architectures and training methodologies.

Markdown Report Issue Upgrade to Chat

References (2)

Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression (2025)

Benchmarking Agentic Workflow Generation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WorFBench Benchmark.