Policy-Invisible Violations in LLM-Based Agents

Published 14 Apr 2026 in cs.AI, cs.CL, cs.CR, and cs.LG | (2604.12177v1)

Abstract: LLM-based agents can execute actions that are syntactically valid, user-sanctioned, and semantically appropriate, yet still violate organizational policy because the facts needed for correct policy judgment are hidden at decision time. We call this failure mode policy-invisible violations: cases in which compliance depends on entity attributes, contextual state, or session history absent from the agent's visible context. We present PhantomPolicy, a benchmark spanning eight violation categories with balanced violation and safe-control cases, in which all tool responses contain clean business data without policy metadata. We manually review all 600 model traces produced by five frontier models and evaluate them using human-reviewed trace labels. Manual review changes 32 labels (5.3%) relative to the original case-level annotations, confirming the need for trace-level human review. To demonstrate what world-state-grounded enforcement can achieve under favorable conditions, we introduce Sentinel, an enforcement framework based on counterfactual graph simulation. Sentinel treats every agent action as a proposed mutation to an organizational knowledge graph, performs speculative execution to materialize the post-action world state, and verifies graph-structural invariants to decide Allow/Block/Clarify. Against human-reviewed trace labels, Sentinel substantially outperforms a content-only DLP baseline (68.8% vs. 93.0% accuracy) while maintaining high precision, though it still leaves room for improvement on certain violation categories. These results demonstrate what becomes achievable once policy-relevant world state is made available to the enforcement layer.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces policy-invisible violations, showing that 90–98.3% of risky cases in LLM agents lead to unintended policy breaches in hidden organizational contexts.
It develops PhantomPolicy, a benchmark with 60 violation cases and 60 control cases across eight categories, to rigorously test the limits of context-aware policy compliance.
It proposes Sentinel, a world-state-grounded enforcement system that achieves 92.99% accuracy and 92.71 F1 by simulating action mutations against declarative invariants.

Policy-Invisible Violations in LLM-Based Agents: Failure Modes, Benchmarking, and Enforcement via World-State Grounding

Introduction and Motivation

This paper introduces and formalizes the concept of policy-invisible violations for LLM-based agents operating in organizational contexts. Such violations arise when agents perform actions that are syntactically and semantically appropriate—but still violate organizational policy—because the necessary policy-relevant facts are hidden from the agent’s execution context. This failure mode is systematically distinct from adversarial attacks, standard authorization failures, or classical data leakage: here, even when the LLM is faithfully following correct user instructions and no explicit malicious intent exists from any party, policy violations emerge because crucial world state is absent from agent visibility.

The challenge is increasingly consequential for enterprise deployments where LLM-based agents must interact with internal tools (e.g., file sharing, communications) in environments with complex, context-dependent policies. The determinative features—such as document audience, user roles, session provenance, or temporal constraints—are rarely encoded in prompt-level data, creating a persistent gap between agent observability and required policy adherence.

PhantomPolicy Benchmark: Scope and Design

To empirically characterize this failure mode, the authors present PhantomPolicy, a diagnostic unit-test benchmark targeting policy-invisible violations. It is carefully constructed around eight violation categories, with 60 violation cases and 60 carefully matched safe-control cases to ensure precise attribution and high-fidelity coverage.

A foundational design policy is that all tool responses return only “clean” business data: no document or file is labeled as sensitive, no path encodes permission scopes, and no recipient metadata is exposed. This ensures that models and enforcement systems cannot depend on visible content artifacts or surface cues, but rather must rely on an explicit world model to detect violations. The benchmark world model includes 30 contacts, 40 documents with diverse sensitivity and audience metadata, and complex project and group structures that reflect real organizational ambiguities such as same-name entities and temporarily misplaced documents.

Eight violation categories formalized in the benchmark are:

Context boundary: Preventing internal context data from leaking to external recipients.
Text-output leakage: Detecting confidential information shared verbally in outputs where content-based detection is ineffective.
Oversharing: Preventing inadvertent inclusion of restricted documents in bulk or folder-level operations.
Audience restriction: Enforcement for documents with explicit restricted audience regardless of apparent surface scope.
Accumulated session leakage: Violation when multiple individually safe accesses combine to form unacceptable information flows.
Cross-context dataflow: Blocking actions that let confidential data reach its forbidden external targets.
High-value resource protection: Special handling for resources flagged as high-importance.
Temporal validity: Capturing violations due to out-of-date entities (e.g., sharing with expired contacts).

The benchmark is not designed for statistical generalization but as a high-precision instrument for diagnosing the structural boundaries where policy-invisible failures occur.

Sentinel: World-State-Grounded Enforcement Architecture

Recognizing that content-only baselines are fundamentally limited in this setting, the paper introduces Sentinel, an enforcement framework grounded in a structured organizational knowledge graph. Sentinel intercepts every proposed agent action, simulates its effect as a graph mutation, and checks whether any of seven declarative invariants—formulated over entities, relationships, and accumulated context—are violated.

Key architectural components of Sentinel are:

Typed property graph: Entities (contacts, documents, projects, groups) are nodes with rich metadata (e.g., scope, status, audience).
Action-to-mutation translation: Each tool action is abstracted as a set of graph mutations, with taint-tracking for session provenance.
Counterfactual simulation: Actions are checked by applying mutations speculatively to a forked graph state, enabling sound, compositional verification.
Declarative invariants: Seven invariants—spanning the eight violation categories—are evaluated in a three-valued logic (Allow, Block, Clarify).
O(|M|) verification complexity: Per-action enforcement is independent of world graph size, enabling real-time deployment.

Sentinel achieves high coverage for complex multi-step cases, such as accumulated session leakage, without requiring the LLM itself to reason about policy state. Instead, the agent’s outputs are interpreted and blocked, clarified, or allowed based on external enforcement layered atop complete world-state access.

Experimental Results and Quantitative Characterization

Baseline evaluations were conducted on five state-of-the-art models (GPT-5.4, GPT-5 mini, GPT-5.4 nano, Claude Sonnet 4.6, Claude Opus 4.6), each tested across all 120 benchmark cases using standardized tool-access prompts. Tool calls corresponding to user requests are executed in the absence of any policy metadata.

Core findings:

Prevalence of policy-invisible violations: Across all models, 90–98.3% of risky cases resulted in actual policy violations under manual, trace-level adjudication. Even in safe-control cases, error rates remain nonzero.
Model self-avoidance is insufficient: Occasional refusal to act or hedging is neither consistent nor explainable in terms of underlying policy, and thus does not constitute reliable enforcement.
Policy-in-prompt mitigations are incomplete: Injecting high-level policy rules into system prompts reduces violation frequency but effect size is model-dependent and safe-case errors persist.
Content-only DLP approaches have high precision, low recall: These methods are fundamentally bounded to 40.13% recall since most decisive attributes never appear in visible content.
Sentinel’s world-state-grounded verification achieves 92.99% accuracy and 92.71 F1 against human-reviewed trace labels, with only five false positives and 37 remaining (missed) violations—mainly attributed to invariant edge-cases and unmodeled world-state features.
Detailed category analysis shows near-universal vulnerability in classes such as accumulated session leakage, audience restriction, and oversharing; safe-case precision is highest in models with lower violation rates.

Theoretical and Practical Implications

Theoretical Analysis

The efficacy of Sentinel is formally characterized:

Verification complexity is O(number of mutations), independent of base graph size.
Conditional soundness: Achieves perfect precision and recall under full world-model and invariant coverage.
Composability: Invariants can be extended or customized compositionally without sacrificing soundness for existing policy constraints.

Practical Deployment and System Design

A central claim supported by empirical and analytic results is that enforcement quality is bottlenecked not by model reasoning nor by invariant sophistication, but by completeness and freshness of the world model. In effect, knowledge acquisition—not model capability or prompt engineering—is the key constraint on system safety in organizational settings.

The paper’s findings motivate explicit privilege separation: LLM-based agents should not be relied upon for policy compliance in isolation. Dedicated enforcement systems with privileged access to organizational state, acting as a mediation layer on tool/action invocation, are necessary. Sentinel’s architecture aligns with this by functionally decoupling execution logic from compliance, analogous to established security and operating systems practice.

Notably, scaling Sentinel beyond a single domain or case suite requires practical ETL-style engineering to synchronize world graphs with dynamic organizational facts (directory services, document management APIs, etc.), and fidelity of enforcement is directly proportional to the coverage and recency of that data.

Limitations and Directions for Future Work

Key limitations acknowledged by the authors include:

Domain and benchmark scope: PhantomPolicy is single-domain and crafted to isolate violation categories rather than to exhaustively represent real-world distributions.
Prompt dependency: All models utilize a task-oriented system prompt likely inflating violation rates relative to more conservative or safety-primed settings.
World model engineering: Building and maintaining a robust, up-to-date organizational knowledge graph remains an unsolved challenge.
Output monitoring: The current enforcement acts solely on tool calls and does not intercept freeform textual output. Extending coverage to model-generated responses is a future research direction.

This research stands apart from prior work on jailbreaking, prompt injection, data leakage prevention (DLP), and policy compliance in LLM agents. The unique focus here is on the non-adversarial, structural gap between agent-visible context and the distributed organizational policy state. Sentinel’s approach, distinct from content-based DLP or prompt-based defenses, directly tackles the epistemic limitations inherent to current agent architectures.

Recent agent security benchmarks largely target adversarial behavior, attack resilience, or privacy leakage in adversarial contexts; PhantomPolicy and Sentinel instead probe the reliability of nominal operations absent adversarial input but in the context of hidden organizational constraints.

Conclusion

The paper presents a rigorous formalization and empirical analysis of policy-invisible agent failures, demonstrating that state-of-the-art LLM-based agents frequently commit high-risk policy violations when deprived of underlying world-state context. The PhantomPolicy benchmark and Sentinel enforcement architecture provide a concrete path forward, turning policy compliance into a tractable graph-based model-checking problem contingent on world-model coverage.

Future research priorities are clear: building scalable, accurate, and comprehensive organizational knowledge graphs and generalizing enforcement architectures like Sentinel to diverse domains and action types; extending invariant sets to further edge-cases; and integrating enforcement directly into agent deployment pipelines for robust, auditable, and context-aware organizational AI.

Strong result: Sentinel achieves 92.99% accuracy and 92.71 F1 on trace-level, human-adjudicated policy-invisible violations, substantially outperforming content-inspection methods and policy-in-prompt approaches, while making explicit the coverage-conditioned limits of automated enforcement. These findings supply a blueprint for enforcing organizational policies in LLM-based agent systems as world-model architectures and organizational AI maturity advance.