Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis

Published 3 Apr 2026 in cs.CR and cs.AI | (2604.02837v1)

Abstract: Agent Skills is an emerging open standard that defines a modular, filesystem-based packaging format enabling LLM-based agents to acquire domain-specific expertise on demand. Despite rapid adoption across multiple agentic platforms and the emergence of large community marketplaces, the security properties of Agent Skills have not been systematically studied. This paper presents the first comprehensive security analysis of the Agent Skills framework. We define the full lifecycle of an Agent Skill across four phases -- Creation, Distribution, Deployment, and Execution -- and identify the structural attack surface each phase introduces. Building on this lifecycle analysis, we construct a threat taxonomy comprising seven categories and seventeen scenarios organized across three attack layers, grounded in both architectural analysis and real-world evidence. We validate the taxonomy through analysis of five confirmed security incidents in the Agent Skills ecosystem. Based on these findings, we discuss defense directions for each threat category, identify open research challenges, and provide actionable recommendations for stakeholders. Our analysis reveals that the most severe threats arise from structural properties of the framework itself, including the absence of a data-instruction boundary, a single-approval persistent trust model, and the lack of mandatory marketplace security review, and cannot be addressed through incremental mitigations alone.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper delivers a rigorous security analysis of Agent Skills, decomposing its architecture and codifying a taxonomy of seven threat vectors.
It highlights that vulnerabilities stem from persistent, undifferentiated consent, open distribution, and local execution without isolation.
Empirical evidence from over 42,000 Skills and documented incidents underscores the urgent need for fine-grained security reforms.

Security Analysis of Agent Skills: Architecture, Threat Taxonomy, and Incident Evidence

Introduction and Context

The "Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis" (2604.02837) paper provides an in-depth security analysis of Agent Skills—an open standard introduced by Anthropic in late 2025 for modular, filesystem-based capability extension in LLM-centric agents. Agent Skills enable dynamic acquisition of domain-specific workflows via modular packages (Skills), each comprising natural language instructions (SKILL.md), scripts, and reference resources. The widespread adoption of Agent Skills across multiple major platforms (including Claude, Cursor, Gemini CLI, GitHub Copilot) and the emergence of large, unregulated public Skills marketplaces have turned these agents into de facto software platforms.

The authors argue that, despite their flexibility and mass adoption, the security implications of Agent Skills diverge fundamentally from earlier mechanisms such as OpenAI’s Plugins and the Model Context Protocol (MCP). The analysis details the structural properties and lifecycle of the Agent Skills framework, surveys real-world attacks, and develops a comprehensive threat taxonomy. The findings reflect that critical vulnerabilities arise chiefly from the structural absence of a data-instruction boundary, the persistent single-approval trust model, and the lack of governance over distribution channels.

Architectural Foundations and Distinctive Properties

Agent Skills eschew strict API schemas, instead relying on natural language behavioral directives combined with arbitrary code in a flat filesystem structure. SKILL.md has a YAML-based metadata frontmatter followed by a Markdown natural language instruction body. Bundled scripts and supplementary files are referenced without any enforced schema—enabling arbitrary multi-step workflows, but removing static type and interface boundaries between instruction and execution.

A progressive disclosure model governs the loading of Skill content: only names and descriptions are preloaded as system prompts, with full instructions and supplementary resources loaded on-demand. This reduces token overhead but increases the likelihood of runtime context contamination and late-stage instruction injection.

The trust model elevates any installed Skill to operator-level authority based on a single user approval, granting persistent permissions over all agent capabilities (filesystem, network, subprocess, etc.). Critically, trust is bound to the Skill’s identity, not to its current content or cryptographic hash, so post-installation modification is not restricted by user consent.

The authors underline that these properties generate an unconstrained attack surface, since natural language instruction and code are treated uniformly, and there are no in-specification mechanisms for privilege separation, scoped consent, or runtime isolation.

Threat Taxonomy

The developed threat taxonomy (seven categories, seventeen scenarios) organizes attacks across three layers anchored in the Agent Skills lifecycle: Delivery/Trust Establishment, Runtime Attack, and Persistence/Lateral Impact.

Supply Chain Compromise: Includes typosquatting, ranking manipulation, repository hijacking, and hallucinated dependency attacks. The lack of mandatory marketplace review and the low authorship barrier facilitate ecosystem-scale compromise.
Consent Abuse: The persistent, single-approval trust model enables attacks exploiting the ‘consent gap’, where granted privileges radically exceed informed user intent. Post-installation modification allows adversaries to inherit trust seamlessly.
Prompt Injection: Both direct and indirect injection, with adversarial instructions in SKILL.md or externally supplied content interpreted at operator level. The absence of syntactic or structural data/instruction separation makes static detection fundamentally limited—prompt injection here is more severe than in prior agent extension systems.
Code Execution: Malicious bundled scripts, deferred dependencies (runtime supply chain attacks on unpinned requirements), and remote code fetch instructions provide practical avenues for arbitrary command execution and malware deployment.
Data Exfiltration: Credential theft, environment variable leakage, and silent project codebase exfiltration are tractable once a Skill is activated, given the lack of granular permission controls and the “invisible” execution of scripts relative to agent context.
Persistence: Memory file and configuration poisoning establish durable compromise that outlives the initial Skill installation and can affect all subsequent agent operations.
Multi-Agent Propagation: Compromised agents in orchestrated pipelines can infect downstream agents via prompt injection, enabling lateral compromise beyond the original attack surface.

The taxonomy is empirically validated through direct mapping to five confirmed ecosystem-scale incidents, including ransomware Skills, supply chain campaigns (e.g., ClawHavoc), malware-laden Skills, codebase exfiltration demonstrations, and configuration injection vulnerabilities.

Real-World Incident Analysis and Numerical Results

Several compelling empirical results underscore the severity and scale of these issues:

A supply chain attack systematically compromised over 1,184 Skills, representing ~20% of available Skill packages in a major marketplace, and delivered credential-harvesting malware to unsuspecting users.
Large-scale audit of 42,447 Skills found that 26.1% contained at least one security vulnerability spanning four principal categories: prompt injection, data exfiltration, privilege escalation, and supply chain threats.
High-impact real-world incidents include ransomware deployment via GIF conversion Skills (MedusaLocker), credential harvesting through infostealer scripts (Atomic macOS Stealer), and codebase exfiltration in enterprise workflows with no audit log traceability.
Prompt injection, both direct and indirect, is the most prevalent and structurally unaddressable vulnerability: 26.1% of Skills in the empirical scan contained prompt injection patterns.

Theoretical and Practical Implications

The paper's findings have broad implications:

Theoretical:

The natural language interface and operator-level integration of Skills render standard static analysis, privilege separation, and artifact signing insufficient. The specification’s intentional absence of formal behavioral contracts means that intent cannot be reliably inferred or enforced.
Architectural gaps, especially in trust models and runtime boundaries, extend the adversarial action space beyond what is possible in traditional extension systems or even in earlier agent extension protocols.
The taxonomy demonstrates that consent, provenance, and integrity mechanisms must be re-evaluated for systems using instruction-in-natural-language as the dominant extensibility paradigm.

Practical:

Agent Skills, as currently specified, are structurally vulnerable to large-scale compromise. Incremental mitigations (e.g., heuristic scanning, bot-detection) do not address root causes.
The lack of mandatory review and accountability in public marketplaces means that defenders cannot rely on post hoc detection, especially at current ecosystem scale.
Persistent trust and lack of version binding for approvals ensure that even diligent users are vulnerable to post-installation adversarial updates.
Cross-agent prompt infection and environmental persistence present emergent risks as multi-agent LLM pipelines become the norm in enterprise deployment.

Future Directions and Research Challenges

Key research challenges and recommendations identified by the authors include:

Natural Language Security Analysis: No formal static analysis, contract inference, or runtime monitoring framework currently exists for arbitrary SKILL.md content. Development of semantic behavioral proxies and runtime intent bounding is required.
Fine-Grained Trust and Consent Models: Current all-or-nothing approval is incompatible with security—a tiered, version-bound or delta-based consent model must be developed that can scale and remain usable.
Automated Skill Vetting and Monitoring: Automated, LLM-assisted vetting pipelines, combined with cryptographic provenance checks and explicit capability claims in Skills, are critical for ecosystem integrity but remain an open research area.
Sandboxing and Capability-Restricted Runtime: Dynamic isolation and capability tiering for script execution are urgently needed, though architectural constraints (workflow expressiveness vs. permission scoping) present significant usability tensions.
Ecosystem-level Specification Reform: Incorporation of signed, versioned Skills, cryptographically-enforced dependency pinning, and mandatory capability declarations in the Agent Skills specification itself is recommended to ensure baseline security properties.

The analysis concludes that many vulnerabilities are rooted in foundational architectural decisions. Prompt injection and consent gap attacks, in particular, are not fixable without substantial reforms to both the Agent Skills specification and its implementation across agent platforms.

Conclusion

This work provides a comprehensive, incident-driven security analysis of Agent Skills. The architectural convenience and community-driven growth of Skills have resulted in an ecosystem structurally exposed to prompt injection, persistent compromise, and large-scale supply chain attacks. The threat taxonomy is validated by widespread real-world incidents. Structural reforms—at the specification, runtime, and marketplace governance layers—are necessary for secure agentic extensibility. As LLM-based modular skill delivery becomes an ecosystem default, formal behavioral specification and architectural permission hygiene must become research and engineering priorities for safe deployment of agentic AI systems (2604.02837).

Markdown Report Issue