- The paper introduces SkillTrojan, a novel backdoor attack framework that embeds malicious payloads within skill packages, diverging from traditional model-centric methods.
- The methodology involves encrypting, fragmenting, and conditionally triggering payloads, achieving up to 97.2% attack success rate while maintaining benign task accuracy.
- The findings emphasize the need for enhanced provenance auditing, sandboxing, and execution trace monitoring to secure skill-based agent systems.
Comprehensive Analysis of "SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems" (2604.06811)
Introduction and Motivation
Skill-based agent systems have become an industry standard for orchestrating complex tasks through compositional, reusable skill modules that encapsulate tool executions, procedural logic, and state management. This modularity enhances scalability and fosters third-party skill marketplaces, but also enlarges the attack surface beyond the model-centric paradigm. The paper introduces SkillTrojan, a novel backdoor attack framework that penetrates this skill abstraction layer, embedding malicious payloads within executable skill packages rather than modifying model parameters, data, prompts, or memory/tool interfaces.
Threat Model and Attack Vector
SkillTrojan uniquely models the attacker as an entity capable of publishing scalable, benign-appearing skill packages to public repositories. The attack targets skill packages—each comprising an instruction specification and executable artifacts—without compromising the underlying agent model or operating system. These skills can be adopted by developers or agents, instantly granting persistent control irrespective of task, user, or deployment context. The payload is partitioned and encrypted, distributed across multiple fragment-emitting skill actions, and triggered solely by user queries matching designated patterns, thus remaining dormant in standard workflows and undetected by behavioral evaluations.
Methodology
Payload Fragmentation and Skill Instrumentation
SkillTrojan encrypts the attacker-specified payload, splits the ciphertext into N indexed fragments, and injects conditional emission logic into designated skill actions. This fragmentation controls activation robustness and per-instance stealthiness, enabling order-agnostic reconstruction and minimizing detection via static or behavioral analysis. Instrumented skills preserve original semantics unless the trigger predicate is satisfied.
Triggered Reconstruction and Execution
Upon a triggered query, the agent's normal skill composition generates the proper execution trace, emitting all fragments for verification. A designated verification action reconstructs, decrypts, and executes the payload as a hidden side effect, while the user receives standard benign outputs. All intermediate artifacts are ephemeral, restricting forensic traceability and cross-run detection.
Automated Backdoored Skill Synthesis
SkillTrojan supports automated generation of backdoored skills from arbitrary templates and payloads, facilitating scalable propagation across skill marketplaces. The released SkillTrojanX dataset contains over 3,000 backdoored skill variants, covering diverse triggers, encryption methods, and payloads for rigorous agent security evaluation.
Empirical Evaluation
Experimental Setup
The framework is benchmarked in code-based agent systems (e.g., EHR SQL generation), evaluating both clean-task accuracy (ACC) and attack success rate (ASR) across open-weight (GLM-4.7, Qwen3-Coder) and closed-weight (GPT-5.2-1211-Global, Claude-Sonnet4.5) LLMs. Competitive baselines—prompt manipulations, memory/tool poisoning, and model-centric attacks—are adapted for rigorous comparison.
Key Results
SkillTrojan consistently achieves high ASR with minimal degradation in benign task performance. For example, on GPT-5.2-1211-Global, SkillTrojan attains 97.2% ASR and 89.3% ACC, outperforming AgentPoison (58.3% ASR, 72.9% ACC) and exhibiting a more favorable ACC-ASR trade-off than all prompt-centric baselines. Clean-task accuracy remains stable across varying poisoning ratios, fragment counts, and encryption schemes, reflecting conditional dormancy and stealth. SkillTrojan's efficacy is replicated in SWE-Bench Verified, confirming generality beyond SQL-centric workloads.
Ablation Studies
Ablation on fragment count reveals that moderate fragmentation maximizes ASR; overly few fragments expose anomalous outputs, while excessive fragmentation risks incomplete reconstruction. Encryption variants (XOR+Base64 vs. hybrid strategies) negligibly affect ACC/ASR but drastically alter detectability by signature-based static heuristics, underscoring the brittleness of naive detection methods.
Implications and Defensive Directions
The paper exposes skill-level backdoors as a critical blind spot in agent security. Attacks embedded in persistent, reusable skill packages escape detection by model-centric defenses and behavioral evaluation protocols. Effective mitigation strategies must audit skill provenance, monitor execution traces, and constrain workflow composition—such as sandboxing, enforcing permissions, and execution-trajectory auditing.
Skill ecosystems require systematic, execution-aware security evaluation. The released SkillTrojanX dataset enables research in static analysis, provenance checks, and robust forensic detection. Defensive mechanisms should target verification/action bottlenecks and recognize side-effect execution as a primary attack success indicator, rather than relying solely on outcome-based monitoring.
Theoretical and Practical Implications
Practically, SkillTrojan demonstrates that skill abstraction decouples malicious logic from agent reasoning, permitting scalable, persistent exploitation independent of prompt, model, or memory/tool interface defenses. Theoretically, this broadens the attack surface for agentic systems, requiring a paradigm shift in agent security research to incorporate compositional execution analysis. As skill-based agents proliferate, reliance on third-party skills renders provenance and modular trust paramount.
Future research directions include: (1) formalizing security policies for skill composition; (2) developing provenance-auditing frameworks for skill repositories; (3) designing execution-layer sandboxing and permission schemes; (4) quantifying the impact of defensive interventions on both ACC and ASR in agent deployment; and (5) automating detection of dormant, fragmented payloads across heterogeneous agent pipelines.
Conclusion
SkillTrojan represents the first systematic attack targeting the skill abstraction layer of agentic systems, embedding payloads into reusable skills and activating malicious behavior via standard skill composition. Empirical results establish a new, highly effective backdoor vector with high attack reliability and low performance cost, surpassing existing prompt-centric and model-centric baselines. This work motivates new defenses for skill-level trust, provenance, and compositional execution analysis, and provides a comprehensive dataset for reproducible research in securing skill-based agent ecosystems.