Agentic Lybic: Multi-Agent Execution System with Tiered Reasoning and Orchestration

Published 14 Sep 2025 in cs.AI, cs.HC, and cs.MA | (2509.11067v2)

Abstract: Autonomous agents for desktop automation struggle with complex multi-step tasks due to poor coordination and inadequate quality control. We introduce Agentic Lybic, a novel multi-agent system where the entire architecture operates as a finite-state machine (FSM). This core innovation enables dynamic orchestration. Our system comprises four components: a Controller, a Manager, three Workers (Technician for code-based operations, Operator for GUI interactions, and Analyst for decision support), and an Evaluator. The critical mechanism is the FSM-based routing between these components, which provides flexibility and generalization by dynamically selecting the optimal execution strategy for each subtask. This principled orchestration, combined with robust quality gating, enables adaptive replanning and error recovery. Evaluated officially on the OSWorld benchmark, Agentic Lybic achieves a state-of-the-art 57.07% success rate in 50 steps, substantially outperforming existing methods. Results demonstrate that principled multi-agent orchestration with continuous quality control provides superior reliability for generalized desktop automation in complex computing environments.

Abstract PDF Upgrade to Chat

Summary

The paper introduces Agentic Lybic and its tiered FSM architecture for robust and adaptive desktop automation.
It leverages a four-tier system comprising Controller, Manager, Worker, and Evaluator to manage multi-step task execution.
Experimental results on the OSWorld benchmark show a 57.07% success rate in handling complex, multi-modal workflows.

Agentic Lybic: Multi-Agent Execution System with Tiered Reasoning and Orchestration

The paper introduces Agentic Lybic, an FSM-based multi-agent execution architecture for desktop automation. By leveraging a sophisticated orchestration framework, the system achieves efficient task execution even in complex multi-step scenarios while maintaining high reliability through continuous quality control.

System Architecture

Agentic Lybic's architecture is built on a four-tiered framework comprising a Controller, a Manager, a Worker subsystem, and an Evaluator. This design adopts finite-state machines (FSMs) to manage dynamic transitions, allowing for adaptive strategies contingent on task complexity and execution outcomes.

Figure 1: State transition diagram of Agentic Lybic showing the tiered orchestration workflow.

Central Controller

The Controller operates as the core orchestrator, managing state transitions and ensuring coherent interaction between components. It defines situations like REPLAN, SUPPLEMENT, GET_ACTION, QUALITY_CHECK, FINAL_CHECK, and EXECUTE_ACTION, each progressively refining task execution strategies. State transitions are determined by a comprehensive state space model $S = (S_T, S_{ST}, S_E, C)$ , enabling granular process control.

Manager and Worker Subsystem

The Manager handles task decomposition using a DAG-based representation, facilitating parallel task execution and adaptive re-planning through triggers. The Worker subsystem includes Operators for GUI tasks, Technicians for system operations, and Analysts for decision support, each utilizing role-specific capabilities to execute actions effectively.

Evaluator

The Evaluator provides continuous quality assessment and early error detection, implementing a Gate Decision Framework with triggers for periodical checks, stagnation, and task completion. This proactive approach to error handling through adaptive feedback enables robust and efficient execution.

Experimental Evaluation

The system's capabilities were evaluated on the OSWorld benchmark, a comprehensive desktop automation testbed consisting of 361 tasks spanning various applications.

Figure 2: OSWorld benchmark task complexity analysis.

Agentic Lybic achieved a new state-of-the-art performance with a 57.07% success rate in 50 steps, surpassing prior systems like CoAct-1 and Agent S2.5. The results underscore its robustness in handling complex workflows, particularly in applications requiring multi-modal coordination.

Figure 3: Successful case demonstrating multi-modal task execution.

Robustness and Limitations

Agentic Lybic's robustness is attributed to its adaptive orchestration mechanisms that manage diverse execution scenarios efficiently. However, certain benchmark evaluation limitations, such as rigid criteria leading to misclassification of successful executions as failures, were noted (Figure 3). The system's execution model accommodates shifts in task demands, ensuring effective error recovery and consistent task fulfillment.

Figure 4: Failure case analysis demonstrating evaluation standard limitations.

Conclusion

Agentic Lybic significantly contributes to advancing desktop automation by integrating an FSM-based multi-agent system that enables real-time adaptation and robust error management. Its architecture not only boosts task success rates but also highlights the potential of organized multi-agent orchestration in solving diverse automation challenges in computing environments.

Future research may explore enhanced modalities, particularly in domains demanding real-time updates such as video manipulation and dynamic user interfaces, expanding the applicability of this architecture to scenarios necessitating collaborative interactions or distributed system coordination.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Glossary

Accessibility hooks: OS-level interfaces that expose UI element metadata for assistive technologies and automation; bypassed when doing pure vision-based screen understanding. "without relying on structured representations like DOM trees or accessibility hooks"
Action repertoire: The defined set of actions an agent can perform within a given modality (e.g., mouse, keyboard, navigation). "The Operator supports a comprehensive action repertoire including fundamental mouse operations (Click, DoubleClick, Move, Drag), keyboard interactions (TypeText, Hotkey), navigation controls (Scroll, SwitchApplications), and specialized functions for different contexts (SetCellValues for spreadsheets, Open for file operations)."
Agentic frameworks: Architectures that coordinate multiple specialized agents or modules to plan, reason, and act on complex tasks. "agentic frameworks focus on orchestrating multiple specialized components to leverage complementary strengths and achieve more robust performance on complex tasks."
Analyst: A specialized worker role providing decision support and analytical reasoning in multi-agent systems. "Analyst: Provides decision support and analytical capabilities for complex reasoning tasks."
Atomic evaluators: Minimal evaluation components used to build rule-based task verification logic. "which expresses each task as Boolean expressions built from the 134 atomic evaluators."
Autonomous agents: Software entities capable of perceiving, reasoning, and acting without continuous human intervention. "Autonomous agents for desktop automation struggle with complex multi-step tasks due to poor coordination and inadequate quality control."
Batch processing: Executing sequences of operations programmatically in bulk, often via scripts or command-line tools. "The Technician is particularly effective for file system operations, environment configuration, batch processing, and any tasks that can be accomplished more reliably through programmatic interfaces than GUI manipulation."
Boolean expressions: Logical formulas composed of boolean operators used for rule-based task verification. "We employ the rule-based evaluator provided by OSWorld, which expresses each task as Boolean expressions built from the 134 atomic evaluators."
Central Controller: The coordinating component that manages global state and orchestrates transitions in a multi-agent FSM. "The Central Controller manages six core situations (REPLAN, SUPPLEMENT, GET ACTION, QUALITY CHECK, FINAL CHECK, EXECUTE ACTION) with dynamic transitions based on execution outcomes."
Deterministic VM snapshot: A precise, reproducible virtual machine state capturing the initial conditions for task evaluation. "a deterministic VM snapshot capturing the initial desktop state"
Directed acyclic graph (DAG): A graph with directed edges and no cycles, used to model subtask dependencies and ordering. "The Manager then transforms the initial plan into a directed acyclic graph (DAG) representation with explicit structure:"
Error propagation: The compounding of small errors across multiple steps in a long sequence, often degrading outcomes. "brittleness in complex scenarios due to visual grounding ambiguity and accumulated error propagation over long sequences."
Executor: The component that actually carries out generated actions at the hardware or system interface level. "Specialized Workers (Operator, Technician, Analyst) execute actions through the Executor"
Finite-state machine (FSM): A formal model of computation with discrete states and transitions used for predictable workflow control. "the entire architecture operates as a finite-state machine (FSM)."
Foundation action model: A large, general-purpose model trained to produce actions across diverse platforms and GUI contexts. "training a foundation action model that generalizes across multiple platforms (Windows, Linux, MacOS, Android, and web)"
Gate decision framework: A structured mechanism that classifies execution status (e.g., done, fail, continue, supplement) to guide orchestration. "Gate Decision Framework: The Evaluator employs a comprehensive gate decision mechanism with four possible outcomes: gate_done (subtask completed successfully), gate_fail (execution failed, requires re-planning), gate_continue (execution in progress, continue current strategy), and gate_supplement (additional information needed)."
Graphical User Interfaces (GUIs): Visual interfaces enabling human-computer interaction via windows, icons, and widgets. "executing tasks through Graphical User Interfaces (GUIs)"
GUI grounding: Mapping language instructions to precise locations or elements on the screen to enable correct interaction. "substantially improve GUI grounding performance, particularly in out-of-distribution scenarios."
Graceful degradation: A robustness property where the system maintains partial functionality when components fail. "providing graceful degradation rather than complete system breakdown."
Hardware interface: The low-level layer through which actions are physically executed on a machine (e.g., mouse/keyboard control). "coordinates actual operation execution through the hardware interface"
Incremental clarification policy: A strategy to iteratively resolve visual or instruction ambiguities during GUI-heavy tasks. "an incremental clarification policy that systematically addresses visual ambiguity in GUI-dense environments."
Intractable tasks: Problems determined to be impossible or impractical to complete under current constraints. "task_impossible (clean termination for intractable tasks)."
Long-horizon tasks: Tasks requiring many steps and sustained coordination over extended sequences. "their handling of long-horizon tasks."
Memorize function: An operation that writes contextual information to shared artifacts for cross-component access. "a unique Memorize function that enables cross-component information sharing by writing contextual memories to shared artifacts for other modules to access."
Multimodal LLM (MLLM): A model that processes and reasons over multiple modalities, such as text and images. "using multimodal LLM~(MLLM) judges for selection"
Operator: A specialized worker role that performs GUI interactions using visual grounding and action generation. "Operator: Manages GUI-based interactions using vision-LLMs for visual grounding and action generation."
Orchestrator: The component in a multi-agent system that dynamically delegates subtasks to appropriate agents. "features an Orchestrator that dynamically delegates subtasks between a GUI Operator and a specialized Programmer agent"
Out-of-distribution scenarios: Cases where inputs differ significantly from training data, challenging model generalization. "particularly in out-of-distribution scenarios."
PERIODIC_CHECK: A scheduled evaluation trigger to monitor progress and detect stagnation during execution. "PERIODIC_CHECK: Regular assessment every 5 execution steps to ensure consistent progress and stagnation detection when identical actions are repeated more than 3 times or single subtask execution exceeds 15 actions."
Planner-grounder paradigm: A modular approach that separates high-level planning from low-level screen action grounding. "The modular planner-grounder paradigm explicitly separates \"what to do\" from \"where and how to act on screen.\""
Programmatic execution: Performing operations via code or scripts instead of GUI interactions. "hybrid systems that combine GUI manipulation with programmatic execution."
Quality gate system: A set of checks and triggers that continuously assess execution quality and determine next steps. "a comprehensive quality gate system with multiple intervention triggers that enable proactive error handling and adaptive re-planning"
Retrieval-Augmented Generation (RAG): A technique that enhances model outputs by retrieving external knowledge to supplement context. "Retrieval-Augmented Generation (RAG) systems"
Rule Engine: A subsystem that enforces operational constraints and monitors system health with configurable thresholds. "The Rule Engine continuously monitors system health through configurable thresholds: maximum state switches (default: 100), task runtime limits, and execution step boundaries."
Rule-based evaluator: An evaluation mechanism that verifies task completion via predefined logical rules. "We employ the rule-based evaluator provided by OSWorld"
Screen parsing: The process of analyzing raw pixel data to identify and locate interactive GUI elements. "screen parsing capabilities"
Semantic element extraction: Identifying and labeling meaningful interface elements (buttons, inputs) from visual data. "semantic element extraction"
Stagnation detection: A mechanism to identify lack of progress, often by repeated identical actions. "stagnation detection when identical actions are repeated more than 3 times"
State-aware orchestration: Coordination that selects strategies based on the current system and task state. "a state-aware orchestration framework that dynamically selects optimal execution strategies based on task characteristics and current system state"
State space: The set of all possible states a system can occupy during execution. "our state-driven orchestration framework (i.e., state transition), where each component operates within a well-defined state space"
State transition function: The formal mapping that determines the next state from the current state, action, and observation. "The state transition function is defined as:"
System-2 reasoning: Deliberative, step-by-step thinking processes contrasted with fast, intuitive System-1. "System-2 reasoning with explicit thought generation"
Technician: A specialized worker role that executes system-level operations via commands and scripts. "Technician: Handles system-level operations through terminal commands and script execution."
Test-time scaling: Improving robustness by sampling multiple candidate actions and selecting among them during inference. "test-time scaling, sampling multiple candidate actions and using multimodal LLM~(MLLM) judges for selection"
Topological sorting: Ordering DAG nodes so that all dependencies precede dependents, producing a valid execution sequence. "Based on the DAG structure, the Manager performs topological sorting to generate the actual execution sequence"
Trigger code system: A lookup-table of triggers that drive state transitions and coordination among components. "The Controller employs a trigger code system (i.e., a state transition look-up-table) organized into ten primary categories"
Vision-LLMs: Models that jointly process visual and textual inputs to understand and act in GUI environments. "vision-LLMs enabling increasingly sophisticated interactions with visual elements"
Visual grounding ambiguity: Uncertainty in mapping instructions to exact on-screen targets due to clutter or similarity. "they suffer from brittleness in complex scenarios due to visual grounding ambiguity"
Worker subsystem: The set of specialized agent roles (Operator, Technician, Analyst) responsible for executing actions. "The Worker subsystem represents a significant advancement over traditional single-modality approaches by implementing three specialized execution roles, each optimized for specific types of operations:"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (8)

Collections

Tweets

alphaXiv

Agentic Lybic: Multi-Agent Execution System with Tiered Reasoning and Orchestration (12 likes, 0 questions)

Agentic Lybic: Multi-Agent Execution System with Tiered Reasoning and Orchestration

Summary

Agentic Lybic: Multi-Agent Execution System with Tiered Reasoning and Orchestration

System Architecture

Central Controller

Manager and Worker Subsystem

Evaluator

Experimental Evaluation

Robustness and Limitations

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Glossary

Open Problems

Continue Learning

Related Papers

Authors (8)

Collections

Tweets

alphaXiv