Papers
Topics
Authors
Recent
Search
2000 character limit reached

Agent Skills Threat Model Overview

Updated 21 January 2026
  • Agent Skills Threat Model is a framework defining modular LLM agent capabilities and their associated risks, including data exfiltration and privilege escalation.
  • It systematically categorizes attack surfaces such as skill scripts, permission models, and web-based vectors to enable comprehensive risk assessment.
  • Defense strategies like mandatory vetting, minimized permissions, and runtime sandboxing are essential to mitigate complex agent skill threats.

Agent Skills Threat Model

Agent skills are modular capabilities—implemented as instructions, scripts, or function interfaces—that are dynamically added to or invoked by LLM agents to extend their functional repertoire. They are central to enabling plug-and-play extension, continual learning, and automation of complex workflows in modern LLM agent architectures. However, the openness, composability, and implicit trust associated with agent skills introduce a significant and rapidly expanding attack surface, encompassing prompt injection, privilege escalation, data exfiltration, workflow hijacking, and cross-agent abuses. The agent skills threat model systematically characterizes attacker goals, vectors, system trust assumptions, architectural weaknesses, vulnerability prevalence, and empirical risks associated with this class of capabilities (Schmotz et al., 30 Oct 2025).

1. Attacker Goals, Capabilities, and Threat Model Definition

The core adversarial objective with respect to agent skills is to leverage their permission and execution model to exfiltrate sensitive data, execute arbitrary or unauthorized code, evade supervision, and maintain persistence within the agent’s operational context. Attackers may be malicious skill authors embedding direct or concealed payloads, supply chain adversaries compromising benign skills, or negligent developers introducing exploitable configurations. The typical threat model assumes the following:

  • The agent platform supports dynamic skill installation (e.g., importing .claude/skills/*/SKILL.md with accompanying scripts).
  • Skills, once approved, execute with full declared privileges, often including file, shell, and network access.
  • Once permission is granted—especially with "don't ask again" (blanket approval)—future skill executions are not individually reviewed or supervised.
  • Attackers may supply arbitrary skill manifests (Markdown, scripts, metadata) and exploit the user's tendency not to inspect full code or execution traces (Schmotz et al., 30 Oct 2025, Liu et al., 15 Jan 2026).

Adversarial capabilities include full control of skill content, knowledge of agent approval models, and, for web-vector exploits, the ability to craft links or data URIs that trigger malicious payloads with minimal user interaction.

2. Threat Surface: Components and Attack Classes

Agent skill threats decompose into specific exploit surfaces:

a) Skill Files and Scripts

  • Skills are stored as Markdown (SKILL.md) with YAML headers and free-form instructions, plus referenced scripts (e.g., Python) loaded interpretively.
  • Every Markdown line is treated as agent-readable or executable, and script calls are dispatched with high trust.

b) System-Level Guardrails

  • Agents request user permission once per command category (e.g., "run Python," "open network socket").
  • “Yes, and don’t ask again” disables future prompts of that type, enabling silent multi-step exploits once initial trust is established (Schmotz et al., 30 Oct 2025).

c) Web and User Interaction Vectors

  • In web interfaces, skills may embed hyperlinks with secret-leakage (e.g., URL fragments containing context variables or passwords).
  • Cross-context leakage relies on the agent's ability to output or direct the user to external content.

Broad vulnerability classes observed include prompt injection, remote code execution, stealth/persistence via permission-carryover, and data exfiltration by HTTP(S) or indirect agent outputs (Schmotz et al., 30 Oct 2025, Liu et al., 15 Jan 2026).

3. Attack Techniques: Walk-throughs and Taxonomy

Attacks on agent skills manifest through various injective and stealth approaches:

Attack Class Vector Conditions & Effects
Prompt Injection Override or hidden instructions in SKILL.md Bypass system/user constraints, arbitrary behavior
Data Exfiltration Scripts POST files/secrets to remote server File/credential theft, context leakage via network
Privilege Escalation Excessive permissions, shell/sudo, credential read Lateral movement, persistence, system compromise
Supply Chain/Obfuscation Unpinned dependencies, external script fetch, obfuscated code Malicious code injection via dependencies
  • Skills can instruct the agent to run secondary or nested scripts that, beyond their ostensible function (e.g. “backup script”), quietly perform file or credential exfiltration (Schmotz et al., 30 Oct 2025).
  • Web interface attacks echo agent-captured secrets in URLs that, when clicked by the victim, leak sensitive data to the attacker’s domain.
  • Approval bypass relies on the observed pattern: once a user selects blanket approval for a category (e.g., Python execution), all related skills run without further prompts, including subsequent malicious actions (Schmotz et al., 30 Oct 2025).
  • Empirical assessment of 31,132 skills in the wild found 26.1% contained at least one vulnerability pattern, with high-severity prompt injection or data exfiltration signatures in 5.2% (Liu et al., 15 Jan 2026).

The table below summarizes the most prevalent high-severity patterns:

Threat Vector Prevalence (%) Severity
Data Exfiltration (E1-E4) 13.3 High/Medium
Privilege Escalation (PE1-PE3) 11.8 High/Medium
Prompt Injection (P1-P4) 0.7 High
Supply Chain (SC1-SC3) 7.4 High/Medium

4. Permission Models, Persistence, and Stealth

Agent platforms employ a manifest-driven, progressive disclosure model: the skill manifest (e.g., SKILL.md YAML) lists required permissions (file access, shell execution, network), and user approval is requested just once per category. Upon approval:

  • Skills, including all their script-level components, run with the declared privileges on all subsequent invocations.
  • The “don’t ask again” option creates a persistent approval, eliminating future user oversight for any equivalent operation (Schmotz et al., 30 Oct 2025).
  • Agents do not revalidate or rescan skill logic on each execution, so once a malicious or backdoored skill is accepted, new payloads can be introduced through later updates or script-level changes (Liu et al., 15 Jan 2026).

This design creates a substantial consent gap exploited for stealthy, persistent, and multi-stage attacks, with empirical evidence showing 100% attack success in “Claude Code” (automated skill invocation after blanket approval) and context-leakage through hyperlinks in “Claude Web” (Schmotz et al., 30 Oct 2025).

5. Detection, Statistical Risk, and Empirical Findings

Comprehensive marketplace scanning (Agent Skill Scan/SkillScan) combining static analysis, LLM-based semantic filtering, and hybrid classifiers reveals:

  • Script-bundled skills are 2.12× more likely to be vulnerable than instruction-only skills (OR=2.12, p<0.001).
  • Precision and recall for the hybrid detection framework reach 86.7% and 82.5% respectively (F1=84.6%) on labeled benchmark sets.
  • Adjusted prevalence, accounting for detection uncertainty, remains between 23% and 30% with a 95% CI.
  • High-severity signature prevalence (e.g., explicit override or exfiltration) encompasses ~5.2% of all skills (Liu et al., 15 Jan 2026).

Direct evaluation of attacks demonstrates that:

  • Malicious skills execute with near-certain success contingent on user installation and at least initial blanket approval.
  • Web-based leakage requires only a single benign-seeming link to be clicked (Schmotz et al., 30 Oct 2025).

Limitations include the residual risk that remains in the absence of thorough manual audit, and defense evasion by integrating malicious instructions covertly or gradually (Schmotz et al., 30 Oct 2025, Liu et al., 15 Jan 2026).

6. Defense Strategies and Risk Mitigation

Robust risk reduction in the agent skills ecosystem requires both preventive and reactive controls:

  • Mandatory skill vetting: Enforce multi-stage review pipelines incorporating static pattern matching, LLM-guarded semantic filtering, and manual audit for high-severity patterns (P1–P3, E2, SC2–SC3) prior to publication or agent installation (Liu et al., 15 Jan 2026).
  • Capability-based permissioning: Agent platforms should require explicit, minimized manifest declarations (e.g., { "file_system": {"read":["./data"], "write":[]}, "network": [], "execute": [] }), coupled with runtime sandboxing (containerization, WebAssembly) to confine the effects of skill execution (Liu et al., 15 Jan 2026).
  • Retention of per-invocation supervision: Wherever practical, avoid blanket approval (“don’t ask again”) for privilege categories, and enable runtime monitoring of outbound data, shell, and network access.
  • Marketplace policies: Require author identity, code signing, dependency pinning (with hashes), and continuous behavioral monitoring for runtime anomalies.
  • Continuous updating of detection logic: Keep pattern and classifier rulesets in pace with evolving attack tactics, leveraging open vulnerability databases and red-team input.

A defense-in-depth roadmap includes immediate integration of SkillScan into CI/CD (Liu et al., 15 Jan 2026), medium-term operationalization of code and author attestations, and long-term goals of formal verification and sandboxed skill execution.

7. Implications and Open Challenges

The widespread, modular adoption of agent skills introduces systemic risk at both the individual agent and supply chain level. The empirical vulnerabilities identified—high likelihood of successful and stealthy exploitation given current permission models, high transition rates from approval to persistence, and substantial latent attack surface—demand rigorous, capability-focused threat modeling and forced least-privilege designs.

Notably, mandatory vetting and permission attestation significantly reduce, but do not eliminate, residual risk, given the steady evolution of adversarial obfuscation and the complexity of formally verifying LLM-agent behavior in the presence of arbitrarily composable skill logic (Liu et al., 15 Jan 2026, Schmotz et al., 30 Oct 2025).

A central conclusion is that dynamic, composable skills without enforced vetting and privilege granularity effectively render agent-centric systems fundamentally insecure by default, with 26.1% of marketplace skills demonstrating at least one actionable vulnerability, and 5.2% indicating strongly malicious intent—a pattern confirmed across both open research datasets and controlled injection experiments (Liu et al., 15 Jan 2026, Schmotz et al., 30 Oct 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Agent Skills Threat Model.