- The paper presents a systematic formulation and empirical demonstration of semantic supply-chain attacks via adversarial modifications in SKILL.md files.
- It uses beam-search and gradient-based optimization to boost embedding similarity, achieving up to 94% win rates in manipulated registry rankings.
- The study exposes significant risks in agent governance, showing that minor linguistic framing can bias skill selection and bypass moderation controls.
Semantic Supply-Chain Attacks via SKILL.md in AI Agent Skill Registries
The modularization of agentic capabilities into reusable "Agent Skills," predominantly distributed through SKILL.md-anchored packages, has catalyzed rapid growth in AI agent ecosystems. However, this packaging paradigm introduces a unique supply-chain attack vector: the natural-language documentation and instructions in SKILL.md that are directly consumed by agents and mediate discovery, selection, and governance stages. "Under the Hood of SKILL.md: Semantic Supply-chain Attacks on AI Agent Skill Registry" (2605.11418) systematically formulates and empirically demonstrates the exploitation of SKILL.md as an operational, adversarial control surface. The core thesis is that even with immutable code and auxiliary files, minimal and targeted natural-language modification to SKILL.md is sufficient to (1) elevate adversarial skills in registry rankings, (2) bias agent-side preference during selection, and (3) evade registry governance controls.
Figure 1: Overview of SKILL.md-only semantic supply-chain attacks highlighting discovery manipulation (ranking boosts), selection preference induction, and evasion of governance scans through adversarial text.
This work separates itself from prior agent-tooling security literature by focusing exclusively on the registry-facing lifecycle—admission, surfacing, and selection—rather than post-loading host-agent compromise or behavioral monitoring.
Discovery Manipulation: Attacking Embedding-Based Retrieval
AI agent skill registries predominantly use embedding-based retrieval for skill discovery. The paper formalizes the attack whereby an adversary appends optimized natural-language "discovery triggers" to SKILL.md with the express goal of colliding the modified skill's embedding with that of a target query, thereby increasing rank and exposure. Crucially, the attacks are engineered so that code and functional behavior remain untouched, and the adversary does not control registry or agent internals.
Methodologically, two optimization strategies are considered:
The empirical results are robust:
- Black-box OpenAI embedding attacks yield an 86.14% win rate and 80% top-10 placement.
- Triggers optimized for BAAI models transfer strongly to OpenAI, indicating nontrivial cross-model robustness.
- Even in ClawHub-style popularity-aware ranking (incorporating lexical, vector, and download-count signals), SKILL.md modifications win in 74.14% of average-day and up to 94% on initial submission (0-hour).
Figure 3: Heatmap of win rates and score boosts against different embedding models, confirming both strong in-model and significant cross-model (transfer) attack success.
Figure 4: Retrieval score distributions; direct optimization against the registry's embedding model dominates, but transfer from surrogates is also effective across all tested skill domains.
These results establish that SKILL.md content can be adversarially tuned to weaponize embedding retrieval, and that semantic attacks propagate through to real-world, multi-factor ranking systems.
Selection Manipulation: Linguistic Framing Bias in Agent Preference
The selection phase investigates whether minimal, high-level changes in SKILL.md—specifically in the "description" field—can bias agent models toward attacker-controlled but functionally equivalent variants over benign ones. The manipulation strategies include false advertising, assertive cues, active maintenance signals, and security/trust claims.
A paired-choice evaluation setup using four contemporary agent models (GPT-4.1-mini, GPT-5, Gemma-4-31B, Qwen3-235B-A22B-Instruct) is employed, across 2,000 trials per model, 100 real-world skills, and four manipulation types.
Figure 5: Selection preference matrix; adversarially manipulated variants consistently dominate original skills (mean 77.6% selection rate), far exceeding random-choice baseline.
Notably, assertive and capability-exaggerating triggers induce the strongest bias, but all strategies yield substantial adversarial uplift.
The findings reveal that agent selection is not guided strictly by underlying skill semantics or utility but is highly susceptible to minor linguistic framing—even when explicit deception or functional difference is absent. This reveals a significant attack vector for influencing agent orchestration.
Registry Governance Evasion: Bypassing Moderation via Semantics
The paper also examines pre-admission vetting—LLM-based analysis, static rule checks, and malware scanning as implemented in modern registries (e.g., ClawHub)—and develops SKILL.md evasion strategies without code alteration.
Evaluated evasion methods:
- LLM Judge Jailbreaking: Attempts to manipulate the LLM review process directly.
- Paraphrasing Malicious Instructions: Semantic rewriting to obscure intent without removing harmful effect.
- Definition-of-Done Framing: Shifting malicious payloads to checklist/check-off phrasing.
- Context-Window Overflow: Placement of the malicious segment beyond LLM moderation context (exploiting truncation).
Empirical results show:
- LLM jailbreaking minimally impacts, but paraphrasing and task-list reframing cause up to 73.9% and 66.3% of variants, respectively, to evade a blocking verdict.
- Context-window overflow effectively circumvents all blocking by ensuring malicious content is not seen by the moderation LLM.
These attacks are resilient across domains and spotlight practical gaps in semantic vetting, especially regarding LLM and scanner context handling and over-reliance on surface-level linguistic cues.
Practical and Theoretical Implications
The research confirms that SKILL.md is a capabilities-critical vector for semantic control over agent ecosystems. The main implications:
- Registry and Agent Trust Boundaries: SKILL.md acts not as passive documentation, but as operational policy. Discovery, selection, and governance algorithms must be robust to adversarial manipulation of language, not just code.
- Attack Composability: The showcased attacks can be easily composed—malicious SKILL.md that bypasses governance will often simultaneously game retrieval and selection, enabling end-to-end supply chain compromise.
- Transferability and Cross-Ecosystem Risk: Semantic triggers optimized on one signature/model retain power across alternatives, suggesting that model diversity or opacity does not confer substantial defensive benefit.
- Governance Gaps: Defensive reliance on context-limited LLM evaluation, surface-level keyword scanning, or lenient suspicious verdicts is inadequate. Full-file, chunked, or multi-pass auditing and stricter suspicious-skill quarantine policies are required.
- Transparency and Attribution: Future registries should support detailed auditing, traceable moderation justifications, and attested provenance for each skill to enable post-facto forensics and user/agent trust recalibration.
For future AI systems, the practical lesson is clear: agent trust surfaces must treat all operational text—including natural-language manifest content—as adversarially mutable and subject to strong semantics-based attack and control.
Conclusion
This work establishes the SKILL.md file as an operational attack vector in agent skill registries and empirically demonstrates that small, targeted language-only adversarial triggers can manipulate discovery rankings, selection behavior, and registry vetting outcomes independently of code or observable functionality. These vulnerabilities pose structurally novel supply-chain risks for agentic platforms relying on third-party skills. Effective hardening will require semantic-aware retrieval/selection, robust multi-stage moderation, rigorous runtime isolation, and transparent auditing across the agent capability supply chain.
Figure 6: Distribution of ClawHub skill download counts, confirming real-world adversarial impact by showing skill popularity and ecosystem reach.