Papers
Topics
Authors
Recent
Search
2000 character limit reached

Under the Hood of SKILL.md: Semantic Supply-chain Attacks on AI Agent Skill Registry

Published 12 May 2026 in cs.AI and cs.CR | (2605.11418v1)

Abstract: Autonomous AI agents increasingly extend their capabilities through Agent Skills: modular filesystem packages whose SKILL.md files describe when and how agents should use them. While this design enables scalable, on-demand capability expansion, it also introduces a semantic supply-chain risk in which natural-language metadata and instructions can affect which skills are admitted, surfaced, selected, and loaded. We study SKILL.md - only attacks across three registry-facing stages of the Agent Skill lifecycle, using real ClawHub skills and realistic registry mechanisms. In Discovery, short textual triggers can manipulate embedding-based retrieval and improve adversarial skill visibility, achieving up to 86% pairwise win rate and 80% Top-10 placement. In Selection, description-only framing biases agents toward functionally equivalent adversarial variants, which are selected in 77.6% of paired trials on average. In Governance, semantic evasion strategies cause malicious skills to avoid a blocking verdict in 36.5%-100% of cases. Overall, our results show that SKILL.md is not passive documentation but operational text that shapes which third-party capabilities agents find, trust, and use.

Summary

  • The paper presents a systematic formulation and empirical demonstration of semantic supply-chain attacks via adversarial modifications in SKILL.md files.
  • It uses beam-search and gradient-based optimization to boost embedding similarity, achieving up to 94% win rates in manipulated registry rankings.
  • The study exposes significant risks in agent governance, showing that minor linguistic framing can bias skill selection and bypass moderation controls.

Semantic Supply-Chain Attacks via SKILL.md in AI Agent Skill Registries

Introduction and Problem Formulation

The modularization of agentic capabilities into reusable "Agent Skills," predominantly distributed through SKILL.md-anchored packages, has catalyzed rapid growth in AI agent ecosystems. However, this packaging paradigm introduces a unique supply-chain attack vector: the natural-language documentation and instructions in SKILL.md that are directly consumed by agents and mediate discovery, selection, and governance stages. "Under the Hood of SKILL.md: Semantic Supply-chain Attacks on AI Agent Skill Registry" (2605.11418) systematically formulates and empirically demonstrates the exploitation of SKILL.md as an operational, adversarial control surface. The core thesis is that even with immutable code and auxiliary files, minimal and targeted natural-language modification to SKILL.md is sufficient to (1) elevate adversarial skills in registry rankings, (2) bias agent-side preference during selection, and (3) evade registry governance controls. Figure 1

Figure 1: Overview of SKILL.md-only semantic supply-chain attacks highlighting discovery manipulation (ranking boosts), selection preference induction, and evasion of governance scans through adversarial text.

This work separates itself from prior agent-tooling security literature by focusing exclusively on the registry-facing lifecycle—admission, surfacing, and selection—rather than post-loading host-agent compromise or behavioral monitoring.

Discovery Manipulation: Attacking Embedding-Based Retrieval

AI agent skill registries predominantly use embedding-based retrieval for skill discovery. The paper formalizes the attack whereby an adversary appends optimized natural-language "discovery triggers" to SKILL.md with the express goal of colliding the modified skill's embedding with that of a target query, thereby increasing rank and exposure. Crucially, the attacks are engineered so that code and functional behavior remain untouched, and the adversary does not control registry or agent internals.

Methodologically, two optimization strategies are considered:

  • Beam-Search Trigger Optimization: Black-box, score-feedback driven, producing plausible trigger continuations using a generator and iteratively maximizing similarity against the registry’s embedding model.
  • Gradient-Based Trigger Optimization: White-box, direct manipulation using gradients from accessible or surrogate embedding models. Figure 2

    Figure 2: Beam-search optimization schematic for discovery triggers demonstrating iterative candidate expansion/scoring for embedding similarity boosts.

The empirical results are robust:

  • Black-box OpenAI embedding attacks yield an 86.14% win rate and 80% top-10 placement.
  • Triggers optimized for BAAI models transfer strongly to OpenAI, indicating nontrivial cross-model robustness.
  • Even in ClawHub-style popularity-aware ranking (incorporating lexical, vector, and download-count signals), SKILL.md modifications win in 74.14% of average-day and up to 94% on initial submission (0-hour). Figure 3

    Figure 3: Heatmap of win rates and score boosts against different embedding models, confirming both strong in-model and significant cross-model (transfer) attack success.

    Figure 4

    Figure 4: Retrieval score distributions; direct optimization against the registry's embedding model dominates, but transfer from surrogates is also effective across all tested skill domains.

These results establish that SKILL.md content can be adversarially tuned to weaponize embedding retrieval, and that semantic attacks propagate through to real-world, multi-factor ranking systems.

Selection Manipulation: Linguistic Framing Bias in Agent Preference

The selection phase investigates whether minimal, high-level changes in SKILL.md—specifically in the "description" field—can bias agent models toward attacker-controlled but functionally equivalent variants over benign ones. The manipulation strategies include false advertising, assertive cues, active maintenance signals, and security/trust claims.

A paired-choice evaluation setup using four contemporary agent models (GPT-4.1-mini, GPT-5, Gemma-4-31B, Qwen3-235B-A22B-Instruct) is employed, across 2,000 trials per model, 100 real-world skills, and four manipulation types. Figure 5

Figure 5: Selection preference matrix; adversarially manipulated variants consistently dominate original skills (mean 77.6% selection rate), far exceeding random-choice baseline.

Notably, assertive and capability-exaggerating triggers induce the strongest bias, but all strategies yield substantial adversarial uplift.

The findings reveal that agent selection is not guided strictly by underlying skill semantics or utility but is highly susceptible to minor linguistic framing—even when explicit deception or functional difference is absent. This reveals a significant attack vector for influencing agent orchestration.

Registry Governance Evasion: Bypassing Moderation via Semantics

The paper also examines pre-admission vetting—LLM-based analysis, static rule checks, and malware scanning as implemented in modern registries (e.g., ClawHub)—and develops SKILL.md evasion strategies without code alteration.

Evaluated evasion methods:

  • LLM Judge Jailbreaking: Attempts to manipulate the LLM review process directly.
  • Paraphrasing Malicious Instructions: Semantic rewriting to obscure intent without removing harmful effect.
  • Definition-of-Done Framing: Shifting malicious payloads to checklist/check-off phrasing.
  • Context-Window Overflow: Placement of the malicious segment beyond LLM moderation context (exploiting truncation).

Empirical results show:

  • LLM jailbreaking minimally impacts, but paraphrasing and task-list reframing cause up to 73.9% and 66.3% of variants, respectively, to evade a blocking verdict.
  • Context-window overflow effectively circumvents all blocking by ensuring malicious content is not seen by the moderation LLM.

These attacks are resilient across domains and spotlight practical gaps in semantic vetting, especially regarding LLM and scanner context handling and over-reliance on surface-level linguistic cues.

Practical and Theoretical Implications

The research confirms that SKILL.md is a capabilities-critical vector for semantic control over agent ecosystems. The main implications:

  • Registry and Agent Trust Boundaries: SKILL.md acts not as passive documentation, but as operational policy. Discovery, selection, and governance algorithms must be robust to adversarial manipulation of language, not just code.
  • Attack Composability: The showcased attacks can be easily composed—malicious SKILL.md that bypasses governance will often simultaneously game retrieval and selection, enabling end-to-end supply chain compromise.
  • Transferability and Cross-Ecosystem Risk: Semantic triggers optimized on one signature/model retain power across alternatives, suggesting that model diversity or opacity does not confer substantial defensive benefit.
  • Governance Gaps: Defensive reliance on context-limited LLM evaluation, surface-level keyword scanning, or lenient suspicious verdicts is inadequate. Full-file, chunked, or multi-pass auditing and stricter suspicious-skill quarantine policies are required.
  • Transparency and Attribution: Future registries should support detailed auditing, traceable moderation justifications, and attested provenance for each skill to enable post-facto forensics and user/agent trust recalibration.

For future AI systems, the practical lesson is clear: agent trust surfaces must treat all operational text—including natural-language manifest content—as adversarially mutable and subject to strong semantics-based attack and control.

Conclusion

This work establishes the SKILL.md file as an operational attack vector in agent skill registries and empirically demonstrates that small, targeted language-only adversarial triggers can manipulate discovery rankings, selection behavior, and registry vetting outcomes independently of code or observable functionality. These vulnerabilities pose structurally novel supply-chain risks for agentic platforms relying on third-party skills. Effective hardening will require semantic-aware retrieval/selection, robust multi-stage moderation, rigorous runtime isolation, and transparent auditing across the agent capability supply chain.


Figure 6

Figure 6: Distribution of ClawHub skill download counts, confirming real-world adversarial impact by showing skill popularity and ecosystem reach.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 9 tweets with 18 likes about this paper.