Undefined security model for handling malicious third‑party content in LLM-based agents

Develop a concrete security model and policy for large language model–based agents operating within multi-agent systems that specifies how to handle malicious third‑party content, including trust boundaries and semantics for inputs, actions, data, and metadata, so that agents can distinguish benign from adversarial inputs and avoid harming users.

Background

The authors observe that current multi-agent systems lack an explicit trust model and semantics for distinguishing data from metadata, leading agents to blindly trust outputs from other agents (confused deputies). This makes systems vulnerable to control-flow hijacking via adversarial content.

They explicitly note that LLMs do not have a security model for dealing with malicious third‑party content and convey uncertainty about what such a model or policy should be. Defining this security model is thus an unresolved foundational question necessary to safely deploy agentic systems that interact with untrusted environments.

References

LLMs do not have a security model for dealing with malicious third-party content, and it is unclear what this model or policy might look like.

Multi-Agent Systems Execute Arbitrary Malicious Code  (2503.12188 - Triedman et al., 15 Mar 2025) in Section 7 (Discussion), Blind trust in confused deputies