Papers
Topics
Authors
Recent
Search
2000 character limit reached

Terminal Agents Suffice for Enterprise Automation

Published 31 Mar 2026 in cs.SE, cs.AI, and cs.CL | (2604.00073v1)

Abstract: There has been growing interest in building agents that can interact with digital platforms to execute meaningful enterprise tasks autonomously. Among the approaches explored are tool-augmented agents built on abstractions such as Model Context Protocol (MCP) and web agents that operate through graphical interfaces. Yet, it remains unclear whether such complex agentic systems are necessary given their cost and operational overhead. We argue that a coding agent equipped only with a terminal and a filesystem can solve many enterprise tasks more effectively by interacting directly with platform APIs. We evaluate this hypothesis across diverse real-world systems and show that these low-level terminal agents match or outperform more complex agent architectures. Our findings suggest that simple programmatic interfaces, combined with strong foundation models, are sufficient for practical enterprise automation.

Summary

  • The paper demonstrates that minimal terminal-based agents can match or exceed complex UI and tool-based agents in enterprise automation.
  • It employs direct API calls via StarShell in controlled experiments on ServiceNow, GitLab, and ERPNext, highlighting high success rates and cost efficiency.
  • The study reveals that persistent filesystem memory and accumulated skills significantly improve task performance while reducing computational overhead.

Terminal Agents for Enterprise Automation: An Evaluation of Direct Programmatic Interfaces

Introduction

The paper "Terminal Agents Suffice for Enterprise Automation" (2604.00073) provides a comprehensive empirical study addressing the efficacy of minimal terminal-based coding agents for automating enterprise-level workflows. Challenging the trend towards increasingly complex agentic stacks—namely GUI-driven web agents and tool-augmented systems like MCP—the authors posit that agents operating solely over terminal and filesystem, directly invoking APIs, can match or even surpass these more elaborate paradigms across practical, production-grade enterprise environments.

Problem Setting and Motivation

Enterprise automation tasks demand agents capable of robust, long-horizon reasoning and reliable state manipulation in mission-critical platforms. Traditional approaches fall into two major paradigms: GUI agents employing web navigation and DOM interaction, and tool-augmented agents accessing curated, high-level tool registries (e.g., MCP). Both introduce additional abstraction layers, entailing costs in brittleness, operational overhead, and potential loss of expressivity—the latter due to restricted tool catalogues. The key research question is whether the availability of stable, expressive platform APIs is sufficient to render these abstractions unnecessary if leveraged via minimal programmatic (terminal-based) interfaces.

StarShell: Terminal-Based Agentic Framework

The proposed system, StarShell, exemplifies the terminal-agent paradigm. StarShell exposes:

  • A bash terminal interface—enabling arbitrary code/script execution and direct API interaction (predominantly via cURL);
  • Persistent filesystem for managing artifacts (e.g. intermediate outputs), accessing documentation, and accumulating agent-generated skills/procedures;
  • Optionally, read access to structured documentation and persistent skill memory. Figure 1

    Figure 1: Overview of StarShell: a minimal terminal agent architecture leveraging direct API calls and persistent memory for enterprise automation.

This design is implemented agnostically to the foundation model; experiments are controlled to ensure the only variable is the interaction paradigm.

Benchmarks and Evaluation Framework

Benchmarks span three enterprise platforms: ServiceNow (ITSM), GitLab (SDLC), and ERPNext (ERP), each deployed in isolated, containerized environments with verified endpoints and datasets of realistic, compositional tasks. Tasks involve record manipulation, multi-step state updates, navigation, and conditional logic, closely reflecting real-world administrative and operational workflows.

Agents are evaluated using several metrics:

  • Success Rate (SR): Percentage of tasks solved per verified post-condition.
  • Cost: Model inference cost (proxying computational/social efficiency).
  • Auxiliary metrics (in appendix): Tool call count and wall-clock time.

Experimental Results

Agent Paradigm Comparison

Terminal agents consistently achieve equivalent or superior SR relative to both web and MCP agents, frequently at a 5–10x lower inference cost. For all tested LLMs (Claude Sonnet 4.6, Claude Opus 4.6, GPT-5.4 Thinking, Gemini 3.1 Pro), MCP agents systematically underperform, with maximal SR seldom exceeding 49%. GUI agents reach high SRs, but at significant computational expense, especially for platforms with complex UIs (e.g., ServiceNow, ERPNext). Terminal agents maintain both high SR and minimal cost. Figure 2

Figure 2: Example execution traces for ordering an iPad Pro: the MCP agent cannot reach the outcome; the web agent is confounded by UI complexity; the terminal agent recovers from errors and efficiently completes the order via direct API interaction.

Error Analysis

A granular analysis (see ServiceNow) confirms that failure modes for terminal agents are not attributed to unreliable API interaction (tool call errors occur at similar rates in failed/successful episodes), but rather to structural limitations—e.g., tasks requiring session-dependent UI actions or inaccessible UI-bound state. Figure 3

Figure 3

Figure 3: (Left) Histogram of tool calls per task for successful vs. failed episodes (ServiceNow). (Right) Breakdown of API error types across successful and failed tasks.

Documentation and Skills

Ablation studies isolate the effect of documentation access. Results are mixed: for ServiceNow, documentation access increases cost and can degrade performance, as the agent is often misled towards unnecessarily complex API flows. By contrast, for ERPNext, documentation provides critical field/discovery cues, increasing SR without efficiency loss. This variance is explained by the alignment (or lack thereof) between documentation structure and agent requirements.

Introducing skills—persistent, agent-generated procedural knowledge—improves both SR and efficiency, particularly for platforms with idiosyncratic, non-standardized field mappings or workflows (notably ERPNext). Skills amortize API exploration and encode shell/quoting workarounds, reducing redundant runtime error recovery.

Extensions and Limits

  • Hybrid Agents: While hybrid (terminal + browser) agents offer a superset of capabilities, their efficacy depends on the model’s ability to select the correct interface per subtask. With more powerful LLMs, hybrid agents improve SR, especially where tasks span both programmatic and UI-only features.
  • Multi-Agent Orchestration: Simple planner-executor systems provide marginal gains for weaker models on complex/long-horizon tasks, but cost efficiency favors the single-agent strategy when using frontier-capability LLMs.
  • Paradigm Boundaries: Tasks truly requiring rich UI interaction (e.g., session-based impersonation, chart visual extraction, drag-and-drop configuration) remain infeasible for terminal agents—suggesting the necessity of hybridization for full coverage, but not as the default mode.

Practical and Theoretical Implications

The principal finding—direct API-level programmatic interaction can suffice for large classes of enterprise automation tasks—calls into question the prevailing tendency toward monolithic, tool-heavy agentic stacks. This has several implications:

  • Platform-side Guidance: The findings reinforce the case for platform vendors to prioritize stable, well-documented, and maximally expressive programmatic interfaces (APIs), rather than focusing solely on automating GUIs or pre-building high-level agent-specific actions.
  • Agent Simplicity/Flexibility: Minimal agentic architectures optimize for efficiency, adaptability (dynamic field exploration, logic composition), and robustness to evolving APIs or system state.
  • Human-Agent Partnership: Persistent skill accumulation and knowledge sharing can be bootstrapped by both agents and human administrators, yielding more generalizable and reusable agent behaviors.

Outlook

Future work should address more extended-horizon tasks—spanning multi-platform coordination, persistent state management, and human-in-the-loop operations—with particular focus on the implications of scaling agentic memory, reasoning, and real-time policy enforcement. Expanded vertical coverage (HR, security, finance, IT ops) will further clarify the boundaries of terminal-agent sufficiency.

Conclusion

The empirical evidence demonstrates that terminal-based coding agents, leveraging direct API interaction in enterprise environments, can match or outperform more complex agentic workflows both in effectiveness and efficiency when expressive APIs are present. While hybrid architectures remain indispensable for inherently UI-bound automation, the structural simplicity and flexibility of terminal agents should be considered the de facto baseline for enterprise automation architecture. Figure 4

Figure 4: StarShell’s web-based trace inspector: supports fine-grained error analysis and insight into agent reasoning and trajectory discovery.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper asks a simple question: Do we really need complicated AI “agents” that click around websites or use lots of custom tools to automate work at companies, or is a much simpler approach enough? The authors argue that a coding agent that uses only a terminal (a place to type commands) and a filesystem (to save files) can do many real company tasks by talking directly to the software’s APIs—and often do it better, cheaper, and more reliably.

What did the researchers want to find out?

They wanted to test, in plain terms:

  • Which is better for automating work in company software: an agent that clicks through web pages like a human, an agent that uses a set of pre-made tools, or a simple coding agent that talks directly to the software’s API?
  • Does reading official documentation help the agent?
  • Can the agent “learn” over time by saving tips and small scripts it discovers (what they call “skills”)?

How did they test their ideas?

They compared three kinds of agents on real tasks inside three popular kinds of enterprise software:

  • ServiceNow (IT helpdesk and workflows)
  • GitLab (software development and issues)
  • ERPNext (business operations like invoices and orders)

They made the agents solve the same sets of tasks, like finding a record, updating something, or completing a small workflow. Each agent used the same AI “brain” (a LLM) so the only difference was how they interacted:

  • Web agent: browses and clicks in a browser like a human.
  • Tool-augmented agent: uses a fixed set of pre-built tools (buttons) provided by a framework called MCP.
  • Terminal agent (their focus): types commands and writes small scripts to talk directly to APIs; it can also save files and read local docs.

They measured:

  • Success rate: What percent of tasks did the agent complete correctly?
  • Cost: How much AI “thinking” did it use (token cost), which is a good proxy for money and time.

What did they discover?

In short: the simple terminal agent usually matched or beat the others and did it at much lower cost.

Here’s what happened across many tasks and several top AI models:

  • Tool-augmented agents (pre-made tools) struggled most. They were cheap to run but had the lowest success. Reason: the tools didn’t cover everything real tasks needed, or were too rigid.
  • Web agents (clicking through the UI) often did well, but were expensive. They had to process lots of screen and page information, which drove up cost.
  • Terminal agents (typing commands and calling APIs) hit the sweet spot. They:
    • Matched or outperformed web agents in many cases
    • Cost far less per task (often 5–9× cheaper)
    • Were flexible, because they weren’t limited by fixed tools and didn’t pay the “overhead” of browsing

They also tested two add-ons for the terminal agent:

  • Access to documentation: Reading official docs didn’t always help. On some platforms it helped a bit; on others it slowed the agent down because the docs weren’t organized in the way agents need (more like a giant reference than a task guide).
  • “Skills” memory: Letting the agent save useful tips and scripts as it worked boosted success and cut cost over time—especially on less familiar platforms. Think of it like the agent keeping a personal notebook of tricks that it reuses later.

Why does this matter?

Companies want AI agents that can safely and reliably do work inside important systems without constant hand-holding. This paper shows:

  • You don’t always need a complex setup. If the software has good APIs, a simple terminal-based coding agent can handle many tasks well.
  • Less can be more. Fewer layers (no fragile UI clicking, no limited tool menus) can mean better performance and lower cost.
  • Documentation should be agent-friendly. Manuals that are organized around real tasks (not just long reference pages) help more.
  • Letting agents build their own “playbook” (skills) makes them better and cheaper over time.

What are the limits and future ideas?

The terminal agent isn’t perfect:

  • Some actions only work in a web browser (like certain user impersonations or drag-and-drop builders).
  • Some tasks involve reading things that only appear in the visual UI (like a chart’s labels or exact formatting).

A promising direction is a hybrid approach: use direct API calls when possible (fast and flexible), and switch to the browser only when needed.

The big takeaway

When enterprise software provides solid APIs, a simple coding agent that lives in the terminal—paired with a filesystem to save notes and scripts—can automate a lot of real work. It often matches or beats more complicated approaches, and costs much less. That means the best path to reliable, affordable automation may be simpler than many people think.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper. Each item is phrased to suggest actionable next steps for future research.

  • Benchmark representativeness: Quantify how task selection favors API-amenable workflows; add a controlled subset where critical steps are UI-only (e.g., impersonation flows, drag‑and‑drop editors) to measure selection bias.
  • Coverage of enterprise verticals: Extend beyond ServiceNow, GitLab, ERPNext to HRIS (e.g., Workday), IAM, finance (e.g., SAP/Oracle), security (e.g., SIEM/SOAR) to test cross-vertical generality.
  • Long-horizon, cross-system automations: Evaluate multi-stage workflows spanning multiple platforms, approvals, SLAs, and time delays (e.g., overnight batch jobs), including state persistence and resumption.
  • RBAC and organizational constraints: Systematically vary roles, permissions, and approval gates to assess agent reliability under least privilege and constrained entitlements.
  • UI-only capability gap: Quantify frequency and impact of tasks that require browser session state or complex UI interactions (charts, WYSIWYG/flow designers); report failure rates and workarounds for terminal agents.
  • Hybrid routing policies: Develop and evaluate learned or rule-based policies that route subtasks between API and GUI modes, with cost/latency-aware switching and confidence thresholds.
  • Robustness to API evolution: Measure degradation and recovery when endpoints, schemas, or auth flows change (version bumps, deprecations); test automatic adaptation using docs, error probing, and schema discovery.
  • Authentication complexity: Assess terminal-agent viability under OAuth2/OIDC, SSO, MFA, short-lived tokens, cookie-bound sessions, and device flow constraints common in enterprise deployments.
  • Network and platform variability: Stress-test under rate limits, throttling, transient failures, pagination quirks, and eventual consistency; evaluate retry/backoff strategies and idempotency safeguards.
  • Safety and harmful-action tests: Introduce adversarial tasks (e.g., mass deletions, PII exfiltration) with defense layers (RBAC, policy checks, dry-run/approval) to quantify risk and mitigation efficacy.
  • Side-effects and rollback: Go beyond binary success to detect unintended changes, verify idempotency, and test transactionality/rollback (e.g., two-phase commit patterns or compensating actions).
  • Cost and efficiency realism: Incorporate API latency, tool execution time, infrastructure compute, and platform licensing into a total cost of ownership analysis, not just token costs.
  • Reproducibility and variance: Report multi-seed runs, run-to-run variance, and statistical significance for all paradigms; include sensitivity to decoding parameters and tool nondeterminism.
  • Fairness of MCP baseline: Compare against enhanced MCP registries with broader coverage, compositional tools, and learned argument schemas; quantify gains from realistic tool engineering.
  • Strength of web-agent baselines: Evaluate stronger GUI agents (e.g., multimodal VLMs with dense DOM grounding, memory, better planning) to ensure fair comparisons.
  • Terminal environment constraints: Test under enterprise-like shells (restricted syscalls, egress proxies, certificate pinning), and measure impacts of quoting, sandboxing, and CIS hardening.
  • Documentation grounding design: Systematically vary RAG pipelines (chunking, retrieval, summarization) and document structure; test authoring guidelines that make docs agent-friendly vs user-oriented.
  • API discovery without docs: Evaluate schema introspection, hypermedia-driven navigation, and self-probing for unknown fields/endpoints; benchmark performance when docs are incomplete or stale.
  • Memory/skills governance: Study security, provenance, versioning, validation, and revocation of self-generated skills; assess risks of stale or harmful procedures being reused.
  • Skill generalization and sharing: Measure cross-task/platform transfer of skills, the benefits/risks of organization-wide skill repositories, and mechanisms for de-duplication and quality control.
  • Failure taxonomy at scale: Provide a comprehensive, cross-platform error taxonomy (beyond one model on one platform), with proportions and targeted fixes for each failure mode.
  • Human-in-the-loop workflows: Evaluate approval gating, selective confirmation, and exception handling; measure how minimal human oversight affects safety, success, and cost.
  • Async and event-driven tasks: Include tasks requiring scheduled jobs, webhooks, or long-running processes; assess monitoring, callback handling, and timeout/retry logic.
  • Non-API or low-API platforms: Test the approach on systems with weak/closed APIs where GUI/RPA is dominant; quantify how much GUI fallback is needed for practical coverage.
  • Security adversaries: Red-team against prompt injection via API data, command/JSON injection, SSRF in curl usage, and data-leak channels; evaluate mitigations (sanitization, allowlists).
  • Data privacy and compliance: Assess handling of PII/PHI, audit logging, and regulatory constraints (e.g., SOX/GDPR); measure auditability of actions taken by terminal agents.
  • Decision criteria for modality choice: Formalize when “terminal suffices” vs “GUI required” based on task features (auth, UI semantics, aggregation fidelity); derive predictive routing heuristics.
  • Comparative agent baselines: Include open-source coding agents (e.g., SWE-agent/OpenHands variants), and state-of-the-art tool-use frameworks to strengthen external validity.
  • Localization and i18n: Evaluate agents on non-English UIs/docs and localized field labels; test robustness to region-specific formats and encodings.
  • Dataset and environment release: Ensure full public release and independent replication; include exact platform versions, seeds, and container configs to validate claims.
  • Economic impact estimates: Model ROI versus human baselines, including setup/maintenance costs for skills, docs pipelines, and security controls in enterprise settings.
  • Maintainability under UI/API drift: Measure effort to maintain web vs terminal vs MCP agents over time as platforms evolve (e.g., breakage frequency, repair time).
  • Ethical deployment playbooks: Translate the high-level safety discussion into concrete deployment patterns (permissions, policy layers, monitoring, incident response) and benchmark their overhead.

Practical Applications

Below is an overview of practical, real-world applications enabled by the paper’s findings that terminal-based, API-first coding agents (e.g., StarShell) can match or outperform more complex agent stacks for enterprise automation at a fraction of the cost. Applications are grouped into those deployable today versus those requiring further development or scaling.

Immediate Applications

The following applications can be piloted or deployed now, contingent on stable APIs, appropriate permissions, and standard enterprise controls.

  • Enterprise ITSM automation via APIs (e.g., ServiceNow)
    • Sectors: IT operations, customer support
    • Potential tools/products/workflows: StarShell-like “Agent CLI” for incident triage, catalog order automation, bulk updates (e.g., state/priority), approvals, and reporting; Slack/Teams ChatOps integration to trigger API actions; scheduled jobs for clean-up and governance tasks
    • Assumptions/dependencies: Stable REST APIs and RBAC; audit logging; sandbox environments for testing; human-in-the-loop for high-risk changes
  • DevOps/SDLC task automation (e.g., GitLab)
    • Sectors: Software engineering, DevOps
    • Potential tools/products/workflows: Agent-driven repo provisioning, branch protection updates, merge request triage/labeling, pipeline re-runs, environment cleanups; “DevOps Assistant” that operates via GitLab/GitHub APIs; integration with existing ChatOps
    • Assumptions/dependencies: API tokens and permissions; rate-limit handling; secure secret management; change approval workflows
  • ERP operations automation (e.g., ERPNext)
    • Sectors: Manufacturing, retail, finance back office
    • Potential tools/products/workflows: Automated invoice creation, purchase orders, inventory adjustments, vendor/customer master updates; “ERP CLI Agent” with reusable API recipes (“skills”) for key document types
    • Assumptions/dependencies: Clear mapping between business fields and API fields; consistent doctype schemas; verification against accounting and compliance policies
  • API-first RPA alternative for brittle GUI bots
    • Sectors: Shared services, BPO, IT
    • Potential tools/products/workflows: Replace GUI screen-scraping bots with API-centric automations for tasks with solid API coverage; “API-first RPA” portfolio for cost and reliability gains
    • Assumptions/dependencies: Adequate API surface for target tasks; exceptions when tasks require UI-only state; change-management for bot migration
  • Runbook-to-skill conversion and internal skills libraries
    • Sectors: IT operations, SRE
    • Potential tools/products/workflows: Agents persist verified, reusable “skills” (API recipes, pitfalls, field mappings) that can be recalled across tasks; internal skills registries with versioning to accelerate future automations
    • Assumptions/dependencies: Knowledge management processes; skills curation and de-duplication; governance for sharing and approval
  • Cost-optimized agent deployments that minimize doc-dependency
    • Sectors: Any enterprise with common SaaS platforms
    • Potential tools/products/workflows: Configure agents to operate efficiently without documentation retrieval on well-known platforms to reduce token cost; selective doc access for less common systems
    • Assumptions/dependencies: Strong parametric knowledge in the chosen model; monitoring to ensure no accuracy regressions
  • Audit-friendly, programmatic automations
    • Sectors: Finance, healthcare, public sector
    • Potential tools/products/workflows: Agents that interact via APIs produce structured logs; verification hooks for state changes; policy-based approvals; integration with SIEM/GRC
    • Assumptions/dependencies: Access to platform audit logs and event streams; clear segregation of duties; approval policies codified as guardrails
  • Academic benchmarking and evaluation harness
    • Sectors: Academia, industrial research
    • Potential tools/products/workflows: Use containerized environments and programmatic verification to benchmark agent paradigms; evaluate cost-performance tradeoffs under identical backbones
    • Assumptions/dependencies: Environment setup and maintenance; standardized scoring; reproducibility protocols
  • ChatOps assistants for routine SaaS actions
    • Sectors: Cross-industry
    • Potential tools/products/workflows: Slack/Teams bots that accept natural language and execute API calls (e.g., “Create a P1 incident,” “Re-run CI on project X,” “Create PO for vendor Y”)
    • Assumptions/dependencies: Role-based permissions; rate limits; approval steps for sensitive actions
  • Procurement and platform-choice guidance (API-first)
    • Sectors: Enterprise IT, procurement
    • Potential tools/products/workflows: Introduce an “API sufficiency checklist” in software selection; pilot with platforms known to have strong APIs to unlock immediate automation value
    • Assumptions/dependencies: Vendor cooperation; internal policy updates; alignment with security teams
  • Power-user personal SaaS automations
    • Sectors: Daily life, SMBs
    • Potential tools/products/workflows: Local terminal agents for personal GitHub/Notion/Trello/Google Workspace tasks (e.g., issue triage, calendar ops, doc updates)
    • Assumptions/dependencies: API keys and scopes; careful handling of personal data; throttling awareness

Long-Term Applications

These opportunities benefit from further research, scaling, hybridization, or organizational change management.

  • Hybrid API + GUI agents for UI-only tasks
    • Sectors: Enterprise software, RPA
    • Potential tools/products/workflows: Orchestrators that prefer APIs but gracefully fall back to browser control for operations requiring session/cookies, rendered charts, or drag-and-drop editors
    • Assumptions/dependencies: Reliable browser automation; secure SSO/session handling; decision policies for when to switch modalities
  • Cross-platform, long-horizon enterprise agents
    • Sectors: IT operations, HR, finance, security
    • Potential tools/products/workflows: End-to-end workflows spanning multiple systems (e.g., joiner–mover–leaver across HRIS, ITSM, IAM; incident-to-change-to-deployment across ServiceNow/GitLab/cloud)
    • Assumptions/dependencies: Identity federation and least-privilege access; robust state management; approvals and audit
  • Organization-wide “skills marketplaces” for agents
    • Sectors: Consulting, large enterprises
    • Potential tools/products/workflows: Versioned, signed, and verified skill libraries that teams can reuse; curation, ranking, and provenance tracking; internal app-store model
    • Assumptions/dependencies: Governance and trust frameworks; compatibility with different platforms; process for deprecating stale skills
  • API quality standards and scoring for agent readiness
    • Sectors: Policy, procurement, software vendors
    • Potential tools/products/workflows: Industry or government standards for API stability, completeness, and documentation clarity; “Agent-Readiness Score” used in RFPs
    • Assumptions/dependencies: Vendor and buyer adoption; standardization bodies; alignment with security and privacy regulations
  • Safety and policy enforcement layers for autonomous agents
    • Sectors: Regulated industries (healthcare, finance, public sector)
    • Potential tools/products/workflows: Policy-as-code engines; multi-step approvals; anomaly detection; agent-in-the-loop guardrails for consequential changes
    • Assumptions/dependencies: Platform APIs that expose pre-commit checks; robust auditability; threat modeling for agent actions
  • Domain-specific vertical agents
    • Sectors: Healthcare, finance, energy, education
    • Potential tools/products/workflows:
    • Healthcare: FHIR-based EHR automations (order sets, scheduling, population health reports)
    • Finance: Post-trade reconciliations, treasury ops, ledger upkeep via APIs
    • Energy: Asset/CMMS updates, work orders in maintenance platforms
    • Education: SIS/LMS integrations for roster updates, grading pipelines
    • Assumptions/dependencies: Mature domain APIs; compliance (HIPAA, SOX, GDPR, FERPA); data governance and PHI/PII protections
  • Training and fine-tuning with agent traces
    • Sectors: Academia, AI product teams
    • Potential tools/products/workflows: Use command–observation–result traces to fine-tune models for better API calling, error recovery, and shell-quoting robustness
    • Assumptions/dependencies: Data rights and privacy; robust labeling; generalization beyond specific platforms
  • RPA migration programs at enterprise scale
    • Sectors: Shared services, BPO
    • Potential tools/products/workflows: Systematic replacement of brittle GUI bots with API-first agents; migration factories with ROI calculators and regression tests
    • Assumptions/dependencies: Adequate API coverage; strong change-management; fallback strategies for UI-only gaps
  • On-prem/edge agents for sensitive environments
    • Sectors: Defense, finance, healthcare
    • Potential tools/products/workflows: Agents running in secure enclaves or VPCs with sealed secrets and attestation; local LLMs for data residency
    • Assumptions/dependencies: Model availability/licensing; hardware and latency constraints; MLOps for secure updates
  • Agent reliability research and products
    • Sectors: AI platforms, developer tooling
    • Potential tools/products/workflows: Libraries that auto-harden API calls (retry, backoff, schema probing), solve shell quoting/JSON issues by construction, and standardize verification hooks
    • Assumptions/dependencies: Broad community adoption; benchmarks to measure reliability improvements
  • Government/enterprise policy emphasizing API-first digital services
    • Sectors: Public sector, enterprise IT governance
    • Potential tools/products/workflows: Policies that require machine-consumable APIs for new systems; incentives for retrofitting legacy services with APIs to enable automation and accessibility
    • Assumptions/dependencies: Budget and timeline for modernization; coexistence plans for legacy UI flows

These applications leverage the paper’s core insight: when stable, expressive APIs are available, minimal terminal-based coding agents can deliver high success rates at significantly lower cost than GUI agents, while avoiding the rigidity of curated tool registries. Feasibility hinges on API availability and quality, robust permissions and audit, and organizational readiness for API-first automation.

Glossary

  • accessibility trees: Structured representations of UI elements used by assistive technologies; in agents, processing these trees increases perceptual overhead. "processing large accessibility trees and screenshots"
  • API-first: An approach prioritizing direct interaction with application programming interfaces over GUI manipulation. "API-first and coding-agent approaches."
  • brittle action chains: Long sequences of UI actions that are fragile and prone to breaking when interfaces change. "long, brittle action chains that are sensitive to interface changes"
  • containerized instances: Isolated, reproducible runtime environments packaged as containers for deployment and evaluation. "All environments are deployed in containerized instances."
  • CRUD operations: Create, Read, Update, Delete actions that define basic data manipulation in APIs or databases. "generic CRUD operations over all document types"
  • curated action schemas: Predefined structured specifications that constrain how models may invoke tools or operations. "curated action schemas through frameworks like Model Context Protocol (MCP)"
  • curated tool registries: Catalogs of pre-approved tools/endpoints that agents can invoke, often limiting expressivity. "curated tool registries simplify invocation at the cost of restricting expressivity"
  • documentation corpus: A collected, locally accessible body of official docs for retrieval by agents during tasks. "a local documentation corpus accessible through the filesystem"
  • doctype: A document type identifier used by platforms like ERP systems to classify records/entities. "non-obvious doctype names"
  • DOM elements: Nodes in the Document Object Model representing parts of a web page that agents can click or inspect. "issuing low-level actions over DOM elements and screenshots"
  • Flow Designer: A graphical workflow builder (e.g., in ServiceNow) for composing automations via UI rather than public APIs. "through standard UI interactions in Flow Designer."
  • foundation models: Large pre-trained models serving as general-purpose backbones for downstream tasks. "simple programmatic interfaces, combined with strong foundation models"
  • frontier LLMs: The latest, most capable LLMs at the cutting edge of capability. "We evaluate multiple frontier LLM backbones"
  • generalist code agents: Agents that solve diverse tasks by writing and executing code directly, without heavy tool curation. "Recent generalist code agents such as Claude Code and OpenClaw"
  • GUI-driven agents: Agents that operate via the graphical user interface, simulating human browsing and clicking. "GUI-driven agents operate through web interfaces"
  • hybrid agent: An agent that combines multiple interaction modes (e.g., APIs and browsing) to handle diverse subtasks. "We explore this approach with a hybrid agent in Appendix~\ref{sec:hybrid}."
  • iframe-based UI: A web interface constructed with embedded frames, often complicating DOM-based navigation for agents. "becomes confused within the iframe-based UI"
  • impersonation: A feature allowing an administrator to act as another user within a platform, often tied to browser session state. "ServiceNow's impersonation feature"
  • LLM backbone: The underlying LLM powering an agent, shared across different agent interaction paradigms. "All agents use the same LLM backbone"
  • MCP server: A server implementing the Model Context Protocol, exposing tools/actions that an agent can call. "web agents operate through the official Playwright MCP server"
  • Model Context Protocol (MCP): A framework that standardizes tool exposure so models can invoke predefined operations. "Model Context Protocol (MCP)"
  • OpenAI Agents SDK: A software development kit used to build and run the agent implementations in the experiments. "All agents are implemented using the OpenAI Agents SDK"
  • parametric knowledge: Information encoded within a model’s learned parameters, as opposed to retrieved from external sources. "rely on their parametric knowledge"
  • Playwright: A browser automation framework used here via an MCP server to provide web interaction capabilities. "Playwright MCP server"
  • programmatic verification: Automated checking of task success by inspecting the live system state via APIs, rather than UI scripts. "programmatic verification against the live platform"
  • REPL-style loop: An iterative cycle of reasoning, executing code/commands, and inspecting outputs to guide next steps. "REPL-style loop of reasoning, execution, and environment inspection"
  • role-based access: A security model where permissions are granted based on a user’s role within an organization. "permissions, role-based access, or auditing"
  • sample-proportion estimator: A statistical formula for estimating the standard error of a binary success rate. "we estimate standard errors for SR using the sample-proportion estimator"
  • sandboxed environment: A restricted execution context that isolates the agent’s actions and filesystem from external systems. "Terminal agents operate in a sandboxed environment with terminal and filesystem access."
  • success rate (SR): The primary metric measuring the percentage of tasks completed successfully. "Our primary metric is success rate (SR)"
  • token usage: The number of input/output tokens consumed by the model, used here to estimate inference cost. "computed from the token usage of the underlying LLM."
  • wall-clock time: Real elapsed time to complete tasks, influenced by infrastructure and environment latencies. "wall-clock time depends on environment and infrastructure latency."
  • web agents: Agents that interact with systems through a browser, observing pages and taking UI actions. "Web agents operate through graphical interfaces"

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 22 tweets with 118 likes about this paper.