SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

Published 8 Apr 2026 in cs.CR and cs.AI | (2604.06811v1)

Abstract: Skill-based agent systems tackle complex tasks by composing reusable skills, improving modularity and scalability while introducing a largely unexamined security attack surface. We propose SkillTrojan, a backdoor attack that targets skill implementations rather than model parameters or training data. SkillTrojan embeds malicious logic inside otherwise plausible skills and leverages standard skill composition to reconstruct and execute an attacker-specified payload. The attack partitions an encrypted payload across multiple benign-looking skill invocations and activates only under a predefined trigger. SkillTrojan also supports automated synthesis of backdoored skills from arbitrary skill templates, enabling scalable propagation across skill-based agent ecosystems. To enable systematic evaluation, we release a dataset of 3,000+ curated backdoored skills spanning diverse skill patterns and trigger-payload configurations. We instantiate SkillTrojan in a representative code-based agent setting and evaluate both clean-task utility and attack success rate. Our results show that skill-level backdoors can be highly effective with minimal degradation of benign behavior, exposing a critical blind spot in current skill-based agent architectures and motivating defenses that explicitly reason about skill composition and execution. Concretely, on EHR SQL, SkillTrojan attains up to 97.2% ASR while maintaining 89.3% clean ACC on GPT-5.2-1211-Global.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces SkillTrojan, a novel backdoor attack framework that embeds malicious payloads within skill packages, diverging from traditional model-centric methods.
The methodology involves encrypting, fragmenting, and conditionally triggering payloads, achieving up to 97.2% attack success rate while maintaining benign task accuracy.
The findings emphasize the need for enhanced provenance auditing, sandboxing, and execution trace monitoring to secure skill-based agent systems.

Comprehensive Analysis of "SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems" (2604.06811)

Introduction and Motivation

Skill-based agent systems have become an industry standard for orchestrating complex tasks through compositional, reusable skill modules that encapsulate tool executions, procedural logic, and state management. This modularity enhances scalability and fosters third-party skill marketplaces, but also enlarges the attack surface beyond the model-centric paradigm. The paper introduces SkillTrojan, a novel backdoor attack framework that penetrates this skill abstraction layer, embedding malicious payloads within executable skill packages rather than modifying model parameters, data, prompts, or memory/tool interfaces.

Threat Model and Attack Vector

SkillTrojan uniquely models the attacker as an entity capable of publishing scalable, benign-appearing skill packages to public repositories. The attack targets skill packages—each comprising an instruction specification and executable artifacts—without compromising the underlying agent model or operating system. These skills can be adopted by developers or agents, instantly granting persistent control irrespective of task, user, or deployment context. The payload is partitioned and encrypted, distributed across multiple fragment-emitting skill actions, and triggered solely by user queries matching designated patterns, thus remaining dormant in standard workflows and undetected by behavioral evaluations.

Methodology

Payload Fragmentation and Skill Instrumentation

SkillTrojan encrypts the attacker-specified payload, splits the ciphertext into $N$ indexed fragments, and injects conditional emission logic into designated skill actions. This fragmentation controls activation robustness and per-instance stealthiness, enabling order-agnostic reconstruction and minimizing detection via static or behavioral analysis. Instrumented skills preserve original semantics unless the trigger predicate is satisfied.

Triggered Reconstruction and Execution

Upon a triggered query, the agent's normal skill composition generates the proper execution trace, emitting all fragments for verification. A designated verification action reconstructs, decrypts, and executes the payload as a hidden side effect, while the user receives standard benign outputs. All intermediate artifacts are ephemeral, restricting forensic traceability and cross-run detection.

Automated Backdoored Skill Synthesis

SkillTrojan supports automated generation of backdoored skills from arbitrary templates and payloads, facilitating scalable propagation across skill marketplaces. The released SkillTrojanX dataset contains over 3,000 backdoored skill variants, covering diverse triggers, encryption methods, and payloads for rigorous agent security evaluation.

Empirical Evaluation

Experimental Setup

The framework is benchmarked in code-based agent systems (e.g., EHR SQL generation), evaluating both clean-task accuracy (ACC) and attack success rate (ASR) across open-weight (GLM-4.7, Qwen3-Coder) and closed-weight (GPT-5.2-1211-Global, Claude-Sonnet4.5) LLMs. Competitive baselines—prompt manipulations, memory/tool poisoning, and model-centric attacks—are adapted for rigorous comparison.

Key Results

SkillTrojan consistently achieves high ASR with minimal degradation in benign task performance. For example, on GPT-5.2-1211-Global, SkillTrojan attains 97.2% ASR and 89.3% ACC, outperforming AgentPoison (58.3% ASR, 72.9% ACC) and exhibiting a more favorable ACC-ASR trade-off than all prompt-centric baselines. Clean-task accuracy remains stable across varying poisoning ratios, fragment counts, and encryption schemes, reflecting conditional dormancy and stealth. SkillTrojan's efficacy is replicated in SWE-Bench Verified, confirming generality beyond SQL-centric workloads.

Ablation Studies

Ablation on fragment count reveals that moderate fragmentation maximizes ASR; overly few fragments expose anomalous outputs, while excessive fragmentation risks incomplete reconstruction. Encryption variants (XOR+Base64 vs. hybrid strategies) negligibly affect ACC/ASR but drastically alter detectability by signature-based static heuristics, underscoring the brittleness of naive detection methods.

Implications and Defensive Directions

The paper exposes skill-level backdoors as a critical blind spot in agent security. Attacks embedded in persistent, reusable skill packages escape detection by model-centric defenses and behavioral evaluation protocols. Effective mitigation strategies must audit skill provenance, monitor execution traces, and constrain workflow composition—such as sandboxing, enforcing permissions, and execution-trajectory auditing.

Skill ecosystems require systematic, execution-aware security evaluation. The released SkillTrojanX dataset enables research in static analysis, provenance checks, and robust forensic detection. Defensive mechanisms should target verification/action bottlenecks and recognize side-effect execution as a primary attack success indicator, rather than relying solely on outcome-based monitoring.

Theoretical and Practical Implications

Practically, SkillTrojan demonstrates that skill abstraction decouples malicious logic from agent reasoning, permitting scalable, persistent exploitation independent of prompt, model, or memory/tool interface defenses. Theoretically, this broadens the attack surface for agentic systems, requiring a paradigm shift in agent security research to incorporate compositional execution analysis. As skill-based agents proliferate, reliance on third-party skills renders provenance and modular trust paramount.

Future research directions include: (1) formalizing security policies for skill composition; (2) developing provenance-auditing frameworks for skill repositories; (3) designing execution-layer sandboxing and permission schemes; (4) quantifying the impact of defensive interventions on both ACC and ASR in agent deployment; and (5) automating detection of dormant, fragmented payloads across heterogeneous agent pipelines.

Conclusion

SkillTrojan represents the first systematic attack targeting the skill abstraction layer of agentic systems, embedding payloads into reusable skills and activating malicious behavior via standard skill composition. Empirical results establish a new, highly effective backdoor vector with high attack reliability and low performance cost, surpassing existing prompt-centric and model-centric baselines. This work motivates new defenses for skill-level trust, provenance, and compositional execution analysis, and provides a comprehensive dataset for reproducible research in securing skill-based agent ecosystems.