Agentic Workflow Reconstruction

Updated 7 February 2026

Agentic Workflow Reconstruction (AWR) is a formal process for synthesizing explicit, interpretable workflow representations in LLM-powered agentic systems.
It employs both white-box and black-box methods to automatically generate modular operator nodes that streamline multi-step reasoning and automation.
Frameworks like A²Flow and DyFlow leverage techniques such as Monte Carlo Tree Search and operator memory to boost proxy fidelity and operational efficiency.

Agentic Workflow Reconstruction (AWR) is the formal process of synthesizing explicit, interpretable workflow representations for agentic systems. These systems, typically mediated by LLMs, execute complex, multi-step reasoning or automation by orchestrating a network of agents and tool invocations. AWR spans both the automated generation of such workflows from high-level specifications (white-box), and the reconstruction of underlying workflows from observed input–output behavior (black-box), targeting both transparency and efficient automation in domains where workflow logic is non-trivial, adaptive, or opaque (Zhao et al., 23 Nov 2025, Wang et al., 30 Sep 2025, Shi et al., 5 Feb 2026, Ye et al., 2023).

1. Theoretical Foundations and Formalization

AWR unifies several lines of work in process automation, agentic reasoning, and workflow search. Formally, an agentic workflow $\mathcal{W}$ is represented as a directed configuration of operator nodes acting on tasks $T$ , where each node $N_i = (M_i, P_i, \tau_i, F_i)$ specifies the LLM (model $M_i$ ), its prompt template $P_i$ , sampling temperature $\tau_i$ , and desired output format $F_i$ . Workflow edges define data or control dependencies (Zhao et al., 23 Nov 2025).

Given black-box input–output pairs $\mathcal{D} = \{(\tau_i, o_i^*)\}_{i=1}^N$ for a task, the AWR objective is to construct a workflow $\mathcal{W}$ such that, for any input $\tau$ , the composed execution $\Phi(\mathbf{s},\tau)$ yields outputs similar to the ground truth $o^*$ under a proxy similarity $\mathrm{Sim}$ : $\mathbf{s}^* = \arg\max_{\mathbf{s}\in\Omega^{\le L_{\max}}} \mathbb{E}_{(\tau,o^*)\sim\mathcal{D}}\Big[\mathrm{Sim}(\Phi(\mathbf{s},\tau),\,o^*)\Big]$ Here, $\mathbf{s}$ denotes a sequence of workflow primitives, each encoding an agentic operation with specified roles, models, reasoning patterns, and toolsets (Shi et al., 5 Feb 2026).

A crucial design is the abstraction of workflow nodes into reusable operators, enabling generalization and modularity. In A $^2$ Flow, operator extraction is formalized as a three-stage pipeline—case-based extraction, clustering/abstraction, and deep induction—yielding a library of domain-specific and cross-task execution operators (Zhao et al., 23 Nov 2025). DyFlow formalizes workflow evolution over discrete time steps, where each stage is a subgraph execution $G_t$ on system state $s_t$ , with iterative updates and branching conditioned on feedback (Wang et al., 30 Sep 2025).

2. Methodological Approaches

Multiple frameworks instantiate AWR using diverse methodologies:

Self-Adaptive Operator Induction and Search: A $^2$ Flow automatically discovers and abstracts operator templates from expert demonstrations, enhancing reusability and eliminating manual engineering bottlenecks. Operator memory is introduced to retain historical intermediate states, enabling nodes to leverage earlier outputs for context-aware transformation (Zhao et al., 23 Nov 2025).
Dynamic Design with Real-Time Feedback: DyFlow adopts a designer–executor split. A designer policy $\pi_\theta$ generates subgoal-directed subgraphs based on condensed summaries of state and feedback. Executors apply operator templates to realize atomic operations with persistent memory, feeding results and errors back to the designer for dynamic replanning or correction (Wang et al., 30 Sep 2025).
Combinatorial Black-Box Workflow Search: AgentXRay formulates AWR as a combinatorial optimization problem, synthesizing a stand-in workflow that matches the black-box agentic system using only I/O access. It employs Red-Black Pruned Monte Carlo Tree Search (MCTS), optimizing for high proxy output similarity while pruning suboptimal search branches, allowing efficient global optimization under tight iteration budgets (Shi et al., 5 Feb 2026).
Programmatic Prompt-Driven Construction: In ProAgent, stepwise LLM-driven planning translates natural language instructions into Pythonic workflow code, embedding LLM-queried agent nodes (e.g., DataAgent, ControlAgent) for dynamic data and control flow. All agent collaboration is encapsulated in the generated code's control and data graph (Ye et al., 2023).

3. Workflow Representation and Operator Abstraction

Key architectures represent agentic workflows as partial-ordered or chain-structured sequences or graphs of LLM-invoking operators:

Framework	Workflow Structure	Operator Abstraction Mechanism
A $^2$ Flow	DAG over operators	LLM-based multi-stage extraction
DyFlow	Sequence of subgraphs	Finite set of modular templates
AgentXRay	Chain (linear)	Unified primitives: agent role, model, reasoning, toolset
ProAgent	Python workflow code	Implicit via code-generation, JSON schema

A $^2$ Flow and DyFlow leverage modularity through operator abstraction and memory, supporting transfer and deep context dependencies. AgentXRay’s stand-in workflows use linearly composed primitives, each parameterized by agent role, model, and reasoning schema (Zhao et al., 23 Nov 2025, Wang et al., 30 Sep 2025, Shi et al., 5 Feb 2026, Ye et al., 2023).

4. Search, Learning, and Optimization Strategies

AWR is operationalized via heuristic and learning-based search mechanisms:

Monte Carlo Tree Search (MCTS): Used in both A $^2$ Flow and AgentXRay for workflow space optimization. AgentXRay’s Red-Black Pruning dynamically scores and colors nodes, prioritizing deeper, promising branches and allowing deeper search within iteration constraints. Empirical results indicate AgentXRay achieves substantially higher fidelity (proxy similarity avg. 0.426 versus 0.339 for unpruned AFlow) and explores deeper workflow structures, up to $L_{max}=6$ (Zhao et al., 23 Nov 2025, Shi et al., 5 Feb 2026).
Two-Stage Learning (SFT, Preference Optimization): DyFlow’s designer policy is initialized via supervised fine-tuning (SFT) on expert-annotated stage subgraphs, then refined via self-play preference optimization (KTO) that prefers policies yielding successful execution trajectories (Wang et al., 30 Sep 2025).
Iterative Prompting, CoT Refinement, and Clustering: A $^2$ Flow leverages long chain-of-thought (CoT) prompting and multi-path reasoning during operator extraction, clustering near-duplicate initial operators and using cross-path aggregation to distill generalizable stubs (Zhao et al., 23 Nov 2025).

5. Empirical Evaluation and Benchmarks

AWR frameworks undergo comprehensive empirical evaluation:

Diverse Domains: Code generation (HumanEval, MBPP), math reasoning (GSM8K, MATH_lc5), reading comprehension (HotpotQA, DROP), embodied tasks (ALFWorld), social reasoning (SocialMaze), biomedical QA (PubMedQA), and general scientific or industrial automation (Zhao et al., 23 Nov 2025, Wang et al., 30 Sep 2025, Shi et al., 5 Feb 2026).
Key Metrics: Proxy similarity (Static Functional Equivalence, CodeBLEU, BLEU/ROUGE/embedding cosine), accuracy/pass@1, F1, success rate, resource usage (token/cost efficiency).
Performance:
- A $^2$ Flow achieves up to 2.4% (general tasks) and 19.3% (embodied/game) absolute gains over prior automated workflow baselines, reducing resource usage by 37% (Zhao et al., 23 Nov 2025).
- DyFlow yields +8.01 pp in aggregate accuracy versus vanilla and modular prompting approaches (Wang et al., 30 Sep 2025).
- AgentXRay attains 25.7% relative gains in proxy fidelity over AFlow (avg. 0.426 vs. 0.339), and reduces token consumption by 8–22% under constrained search budgets (Shi et al., 5 Feb 2026).
Ablation Results: Removal of operator memory in A $^2$ Flow results in significant performance drops (e.g., –4.1% on MATH), confirming the necessity of context memory and operator abstraction (Zhao et al., 23 Nov 2025).

6. Limitations and Open Challenges

Several challenges and limitations have been identified:

Representation Bias and Linearity: Restricting workflows to linear or chain structures, as in AgentXRay, may fail to capture true concurrency or complex control/dataflow patterns of DAG-structured agentic systems (Shi et al., 5 Feb 2026).
Proxy Fidelity: Automated proxy metrics (SFE, AST matching) may overlook subtler effects and fail to guarantee behavioral equivalence, especially regarding side-effects or external interactions (Shi et al., 5 Feb 2026).
Operator Generalization: Manual or shallowly extracted operators limit scalability and domain transfer, an issue directly addressed in A $^2$ Flow via automated multi-stage abstraction (Zhao et al., 23 Nov 2025).
Feedback Bottlenecks: Frameworks lacking intermediate feedback integration (e.g., static plans) are more brittle to task drift or failure (Wang et al., 30 Sep 2025).
Evaluation Coverage: Large-scale, real-world benchmarks and systematic error/deviation analysis are not yet universally available; several studies remain at the proof-of-concept or moderate-scale evaluation phase (Ye et al., 2023).

7. Prospects and Extensions

Recent work proposes several directions for extending AWR:

Graph-Structured and Multi-Agent Workflows: Generalize from chains to full DAGs, allowing modeling of parallel, asynchronous, or multi-threaded agent interactions (Shi et al., 5 Feb 2026).
Rich Feedback and Learning Signals: Move beyond output-level proxy similarity to incorporate step-level or differentiable rewards and allow denser learning signals to guide workflow optimization (Shi et al., 5 Feb 2026).
Process Mining and Human-in-the-Loop Hybridization: Integrate real-world execution logs for operator grounding, and add human checkpoints or audits for critical decision branches, as advocated in ProAgent (Ye et al., 2023).
Parameter-Efficient Tuning and Robustness: Focused fine-tuning (e.g., RLHF/parameter-efficient adaptation on agent calls) to reduce branching errors and malicious behavior (Ye et al., 2023).
Benchmarking and Systematization: Develop comprehensive suites pairing task descriptions, reference workflows, and success metrics for rigorous, community-wide benchmarking (Zhao et al., 23 Nov 2025, Ye et al., 2023).

Agentic Workflow Reconstruction, as formalized by recent literature, constitutes the foundation for interpretable, modular, and adaptive agentic systems that close the transparency gap between opaque LLM-powered services and auditable, modifiable automation, with robust empirical advances in fidelity, efficiency, and task generalization (Zhao et al., 23 Nov 2025, Wang et al., 30 Sep 2025, Shi et al., 5 Feb 2026, Ye et al., 2023).