Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale

Published 2 Mar 2026 in cs.CL | (2603.02176v1)

Abstract: The rapid proliferation of Claude agent skills has raised the central question of how to effectively leverage, manage, and scale the agent skill ecosystem. In this paper, we propose AgentSkillOS, the first principled framework for skill selection, orchestration, and ecosystem-level management. AgentSkillOS comprises two stages: (i) Manage Skills, which organizes skills into a capability tree via node-level recursive categorization for efficient discovery; and (ii) Solve Tasks, which retrieves, orchestrates, and executes multiple skills through DAG-based pipelines. To evaluate the agent's ability to invoke skills, we construct a benchmark of 30 artifact-rich tasks across five categories: data computation, document creation, motion video, visual design, and web interaction. We assess the quality of task outputs using LLM-based pairwise evaluation, and the results are aggregated via a Bradley-Terry model to produce unified quality scores. Experiments across three skill ecosystem scales (200 to 200K skills) show that tree-based retrieval effectively approximates oracle skill selection, and that DAG-based orchestration substantially outperforms native flat invocation even when given the identical skill set.Our findings confirm that structured composition is the key to unlocking skill potential. Our GitHub repository is available at:https://github.com/ynulihao/AgentSkillOS.

Abstract PDF Upgrade to Chat

Summary

The paper introduces AgentSkillOS, a framework that organizes and benchmarks agent skills using a hierarchical capability tree and DAG-based orchestration.
It details a three-stage process of skill retrieval, orchestration, and multi-skill task execution, enhancing the compositional potential of large ecosystems.
Experimental results demonstrate that structured orchestration, particularly the Quality-First strategy, significantly outperforms flat invocation methods across diverse tasks.

AgentSkillOS: Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale

Introduction

The rapid expansion of agent skill ecosystems driven by third-party contributions poses significant challenges for skill selection, orchestration, and overall management. The paper "Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale" (2603.02176) introduces AgentSkillOS, a framework that formalizes and systematizes the organization and orchestration of skills at scale. AgentSkillOS addresses critical limitations in prior systems, which lack principled mechanisms for effective skill discovery, structured composition, and scalable management as the ecosystem grows in size and heterogeneity.

Framework Design: Capability Tree and DAG-based Skill Orchestration

AgentSkillOS operates via two clearly separated but interdependent processes: skill management and task resolution. Skills within the ecosystem are hierarchically organized into a capability tree through node-level recursive categorization implemented using LLMs. This hierarchical structure supports efficient, coarse-to-fine retrieval and discovery, surfacing both obvious and non-obvious skills relevant to specific tasks.

Figure 1: The AgentSkillOS framework enables efficient skill retrieval, structured orchestration, and scalable ecosystem management for complex user tasks.

For user-specified tasks, AgentSkillOS implements a three-stage process for skill invocation:

Task-driven skill retrieval traverses the capability tree, leveraging LLM reasoning and embedding similarity to maximize recall of relevant skills with explicit deduplication and ranking.
DAG-based skill orchestration decomposes task resolution into interdependent sub-tasks, constructing an explicit DAG of skill invocations. Multiple orchestration strategies (Quality-First, Efficiency-First, Simplicity-First) generate alternative compositional plans reflecting distinct optimization objectives.
Multi-skill task execution executes the constructed DAG, enforcing inter-skill dependencies and enabling both sequential and parallel execution pathways.

The combination of hierarchical retrieval and explicit orchestration mechanisms is shown to unlock the compositional potential of the ecosystem, overcoming the scalability bottlenecks and flat invocation limitations inherent to prior agent SDKs.

Benchmark and Evaluation Protocol

To quantify agent performance in large, heterogeneous skill ecosystems, the authors introduce a benchmark of 30 artifact-rich tasks covering Data Computation, Document Creation, Motion Video, Visual Design, and Web Interaction. The tasks are diverse, requiring the production of complex, multi-modal user-facing artifacts and spanning a range of skill compositional complexity.

Artifacts are evaluated through LLM-based pairwise judgment protocols, mitigating position bias by dual ordering and aggregating outcomes with a Bradley-Terry model to obtain continuous, fine-grained system quality scores.

Figure 2: The benchmark evaluation pipeline spans diverse artifact formats, conversion to LLM-evaluable form, robust pairwise comparison, and Bradley--Terry model-based ranking.

A detailed breakdown of benchmark tasks by category, number of required skills, file outputs, and output formats highlights the range and complexity addressed.

Figure 3: Benchmark tasks are systematically categorized, with complexity characterized along skill, artifact, and format axes.

Experimental Results and Analysis

Experiments are conducted on skill ecosystems of three sizes: 200 (manually curated), 1K, and 200K (sourced from open marketplaces and repositories).

Key findings:

AgentSkillOS (Quality-First, Efficiency-First, Simplicity-First) achieves the top scores in all settings, with the Quality-First variant consistently dominating across all scales.
The traditional approach of supplying all skills directly to the agent (w/ Full Pool) fails to leverage larger ecosystems, manifesting muted performance due to discoverability and invocation limitations.
Flat invocation strategies, even with access to ground-truth skill sets, are not competitive with DAG-based orchestration, underscoring the necessity of explicit structured composition for complex task resolution.

Per-category analyses further reveal that only structured orchestration yields robust coverage and high-quality outputs across diverse task types.

Figure 4: Category-level and overall quality scores (Bradley--Terry) demonstrate the robust advantage of AgentSkillOS variants across all domains and scales.

Ablation studies isolate the contributions of capability-tree retrieval and DAG-based orchestration. Removal of the latter, even with oracle skill selections, produces a clear performance gap, indicating that orchestration strategy is an independent and critical factor.

Figure 5: Ablation studies demonstrate that both capability-tree retrieval and DAG orchestration are essential; each removal degrades win/tie/loss counts significantly.

The orchestration strategies, while operationally distinct, also result in structurally different execution graphs, as quantified by node/edge counts, width, and depth metrics. This confirms the framework's capacity to induce and expose meaningful trade-offs in execution profiles.

Figure 6: The three DAG orchestration strategies yield structurally distinct plans, reflecting their differing optimization principles.

Qualitative analysis demonstrates that AgentSkillOS produces more professional, usable, and visually compelling artifacts than native agent SDKs lacking structured skill composition.

Figure 7: Case studies underscore large qualitative improvements: AgentSkillOS yields artifacts with high visual/design quality and domain specificity, in contrast to flat/vanilla baselines.

Implications and Future Directions

The implications are both practical and theoretical. At ecosystem scale, AgentSkillOS extends the operationality and compositional generalization of agentic systems by addressing both discovery and structural coordination of skills. Practically, it resolves a core bottleneck in skill ecosystem usability for real-world creative and computational tasks. Theoretically, it validates the hypothesis that explicit orchestration (beyond flat invocation) is essential for maximizing agent utility within large, heterogeneous tool abstractions.

The framework also opens avenues for further research:

Automated skill acquisition and continuous skill integration, removing the assumption of pre-collected skill pools.
Skill auto-evolution and self-modifying agents, leveraging the machine-readability of skills for knowledge refinement and failure correction.
Robustness and security issues in open skill ecosystems, where decentralized contributions pose problems for quality and governance.

Conclusion

The AgentSkillOS framework substantiates a principled, scalable approach to skill ecosystem management, demonstrating that hierarchical retrieval combined with explicit DAG-based skill orchestration is essential for fully leveraging large-scale agent skill ecosystems. Empirical results show that just increasing skill pool size without structure is insufficient; the full compositional potential is unlocked only through structured retrieval and orchestration. This work provides an extensible foundation for future research on large-scale agent frameworks and skill-centric LLM systems (2603.02176).