Agentic Lybic: Multi-Agent Execution System with Tiered Reasoning and Orchestration
Abstract: Autonomous agents for desktop automation struggle with complex multi-step tasks due to poor coordination and inadequate quality control. We introduce Agentic Lybic, a novel multi-agent system where the entire architecture operates as a finite-state machine (FSM). This core innovation enables dynamic orchestration. Our system comprises four components: a Controller, a Manager, three Workers (Technician for code-based operations, Operator for GUI interactions, and Analyst for decision support), and an Evaluator. The critical mechanism is the FSM-based routing between these components, which provides flexibility and generalization by dynamically selecting the optimal execution strategy for each subtask. This principled orchestration, combined with robust quality gating, enables adaptive replanning and error recovery. Evaluated officially on the OSWorld benchmark, Agentic Lybic achieves a state-of-the-art 57.07% success rate in 50 steps, substantially outperforming existing methods. Results demonstrate that principled multi-agent orchestration with continuous quality control provides superior reliability for generalized desktop automation in complex computing environments.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Glossary
- Accessibility hooks: OS-level interfaces that expose UI element metadata for assistive technologies and automation; bypassed when doing pure vision-based screen understanding. "without relying on structured representations like DOM trees or accessibility hooks"
- Action repertoire: The defined set of actions an agent can perform within a given modality (e.g., mouse, keyboard, navigation). "The Operator supports a comprehensive action repertoire including fundamental mouse operations (Click, DoubleClick, Move, Drag), keyboard interactions (TypeText, Hotkey), navigation controls (Scroll, SwitchApplications), and specialized functions for different contexts (SetCellValues for spreadsheets, Open for file operations)."
- Agentic frameworks: Architectures that coordinate multiple specialized agents or modules to plan, reason, and act on complex tasks. "agentic frameworks focus on orchestrating multiple specialized components to leverage complementary strengths and achieve more robust performance on complex tasks."
- Analyst: A specialized worker role providing decision support and analytical reasoning in multi-agent systems. "Analyst: Provides decision support and analytical capabilities for complex reasoning tasks."
- Atomic evaluators: Minimal evaluation components used to build rule-based task verification logic. "which expresses each task as Boolean expressions built from the 134 atomic evaluators."
- Autonomous agents: Software entities capable of perceiving, reasoning, and acting without continuous human intervention. "Autonomous agents for desktop automation struggle with complex multi-step tasks due to poor coordination and inadequate quality control."
- Batch processing: Executing sequences of operations programmatically in bulk, often via scripts or command-line tools. "The Technician is particularly effective for file system operations, environment configuration, batch processing, and any tasks that can be accomplished more reliably through programmatic interfaces than GUI manipulation."
- Boolean expressions: Logical formulas composed of boolean operators used for rule-based task verification. "We employ the rule-based evaluator provided by OSWorld, which expresses each task as Boolean expressions built from the 134 atomic evaluators."
- Central Controller: The coordinating component that manages global state and orchestrates transitions in a multi-agent FSM. "The Central Controller manages six core situations (REPLAN, SUPPLEMENT, GET ACTION, QUALITY CHECK, FINAL CHECK, EXECUTE ACTION) with dynamic transitions based on execution outcomes."
- Deterministic VM snapshot: A precise, reproducible virtual machine state capturing the initial conditions for task evaluation. "a deterministic VM snapshot capturing the initial desktop state"
- Directed acyclic graph (DAG): A graph with directed edges and no cycles, used to model subtask dependencies and ordering. "The Manager then transforms the initial plan into a directed acyclic graph (DAG) representation with explicit structure:"
- Error propagation: The compounding of small errors across multiple steps in a long sequence, often degrading outcomes. "brittleness in complex scenarios due to visual grounding ambiguity and accumulated error propagation over long sequences."
- Executor: The component that actually carries out generated actions at the hardware or system interface level. "Specialized Workers (Operator, Technician, Analyst) execute actions through the Executor"
- Finite-state machine (FSM): A formal model of computation with discrete states and transitions used for predictable workflow control. "the entire architecture operates as a finite-state machine (FSM)."
- Foundation action model: A large, general-purpose model trained to produce actions across diverse platforms and GUI contexts. "training a foundation action model that generalizes across multiple platforms (Windows, Linux, MacOS, Android, and web)"
- Gate decision framework: A structured mechanism that classifies execution status (e.g., done, fail, continue, supplement) to guide orchestration. "Gate Decision Framework: The Evaluator employs a comprehensive gate decision mechanism with four possible outcomes: gate_done (subtask completed successfully), gate_fail (execution failed, requires re-planning), gate_continue (execution in progress, continue current strategy), and gate_supplement (additional information needed)."
- Graphical User Interfaces (GUIs): Visual interfaces enabling human-computer interaction via windows, icons, and widgets. "executing tasks through Graphical User Interfaces (GUIs)"
- GUI grounding: Mapping language instructions to precise locations or elements on the screen to enable correct interaction. "substantially improve GUI grounding performance, particularly in out-of-distribution scenarios."
- Graceful degradation: A robustness property where the system maintains partial functionality when components fail. "providing graceful degradation rather than complete system breakdown."
- Hardware interface: The low-level layer through which actions are physically executed on a machine (e.g., mouse/keyboard control). "coordinates actual operation execution through the hardware interface"
- Incremental clarification policy: A strategy to iteratively resolve visual or instruction ambiguities during GUI-heavy tasks. "an incremental clarification policy that systematically addresses visual ambiguity in GUI-dense environments."
- Intractable tasks: Problems determined to be impossible or impractical to complete under current constraints. "task_impossible (clean termination for intractable tasks)."
- Long-horizon tasks: Tasks requiring many steps and sustained coordination over extended sequences. "their handling of long-horizon tasks."
- Memorize function: An operation that writes contextual information to shared artifacts for cross-component access. "a unique Memorize function that enables cross-component information sharing by writing contextual memories to shared artifacts for other modules to access."
- Multimodal LLM (MLLM): A model that processes and reasons over multiple modalities, such as text and images. "using multimodal LLM~(MLLM) judges for selection"
- Operator: A specialized worker role that performs GUI interactions using visual grounding and action generation. "Operator: Manages GUI-based interactions using vision-LLMs for visual grounding and action generation."
- Orchestrator: The component in a multi-agent system that dynamically delegates subtasks to appropriate agents. "features an Orchestrator that dynamically delegates subtasks between a GUI Operator and a specialized Programmer agent"
- Out-of-distribution scenarios: Cases where inputs differ significantly from training data, challenging model generalization. "particularly in out-of-distribution scenarios."
- PERIODIC_CHECK: A scheduled evaluation trigger to monitor progress and detect stagnation during execution. "PERIODIC_CHECK: Regular assessment every 5 execution steps to ensure consistent progress and stagnation detection when identical actions are repeated more than 3 times or single subtask execution exceeds 15 actions."
- Planner-grounder paradigm: A modular approach that separates high-level planning from low-level screen action grounding. "The modular planner-grounder paradigm explicitly separates \"what to do\" from \"where and how to act on screen.\""
- Programmatic execution: Performing operations via code or scripts instead of GUI interactions. "hybrid systems that combine GUI manipulation with programmatic execution."
- Quality gate system: A set of checks and triggers that continuously assess execution quality and determine next steps. "a comprehensive quality gate system with multiple intervention triggers that enable proactive error handling and adaptive re-planning"
- Retrieval-Augmented Generation (RAG): A technique that enhances model outputs by retrieving external knowledge to supplement context. "Retrieval-Augmented Generation (RAG) systems"
- Rule Engine: A subsystem that enforces operational constraints and monitors system health with configurable thresholds. "The Rule Engine continuously monitors system health through configurable thresholds: maximum state switches (default: 100), task runtime limits, and execution step boundaries."
- Rule-based evaluator: An evaluation mechanism that verifies task completion via predefined logical rules. "We employ the rule-based evaluator provided by OSWorld"
- Screen parsing: The process of analyzing raw pixel data to identify and locate interactive GUI elements. "screen parsing capabilities"
- Semantic element extraction: Identifying and labeling meaningful interface elements (buttons, inputs) from visual data. "semantic element extraction"
- Stagnation detection: A mechanism to identify lack of progress, often by repeated identical actions. "stagnation detection when identical actions are repeated more than 3 times"
- State-aware orchestration: Coordination that selects strategies based on the current system and task state. "a state-aware orchestration framework that dynamically selects optimal execution strategies based on task characteristics and current system state"
- State space: The set of all possible states a system can occupy during execution. "our state-driven orchestration framework (i.e., state transition), where each component operates within a well-defined state space"
- State transition function: The formal mapping that determines the next state from the current state, action, and observation. "The state transition function is defined as:"
- System-2 reasoning: Deliberative, step-by-step thinking processes contrasted with fast, intuitive System-1. "System-2 reasoning with explicit thought generation"
- Technician: A specialized worker role that executes system-level operations via commands and scripts. "Technician: Handles system-level operations through terminal commands and script execution."
- Test-time scaling: Improving robustness by sampling multiple candidate actions and selecting among them during inference. "test-time scaling, sampling multiple candidate actions and using multimodal LLM~(MLLM) judges for selection"
- Topological sorting: Ordering DAG nodes so that all dependencies precede dependents, producing a valid execution sequence. "Based on the DAG structure, the Manager performs topological sorting to generate the actual execution sequence"
- Trigger code system: A lookup-table of triggers that drive state transitions and coordination among components. "The Controller employs a trigger code system (i.e., a state transition look-up-table) organized into ten primary categories"
- Vision-LLMs: Models that jointly process visual and textual inputs to understand and act in GUI environments. "vision-LLMs enabling increasingly sophisticated interactions with visual elements"
- Visual grounding ambiguity: Uncertainty in mapping instructions to exact on-screen targets due to clutter or similarity. "they suffer from brittleness in complex scenarios due to visual grounding ambiguity"
- Worker subsystem: The set of specialized agent roles (Operator, Technician, Analyst) responsible for executing actions. "The Worker subsystem represents a significant advancement over traditional single-modality approaches by implementing three specialized execution roles, each optimized for specific types of operations:"
Collections
Sign up for free to add this paper to one or more collections.