Why Does Agentic Safety Fail to Generalize Across Tasks?

Published 7 May 2026 in cs.LG and stat.ML | (2605.06992v1)

Abstract: AI agents are increasingly deployed in multi-task settings, where the task to perform is specified at test time, and the agent must generalize to unseen tasks. A major concern in such settings is safety: often, an agent must not only execute unseen tasks, but do so while avoiding risks and handling ones that materialize. Empirical evidence suggests that even when the ability to execute generalizes to unseen tasks, the ability to do so safely frequently does not. This paper provides theory and experiments indicating that failures of agentic safety to generalize across tasks are not merely due to limitations of training methods, but reflect an inherent property of safety itself: the relationship between a task and its safe execution is more complex than the relationship between a task and its execution alone. Theoretically, we analyze linear-quadratic control with $H_{\infty}$-robustness, and prove that the mapping from task specification to an optimal controller has higher Lipschitz constant with safety requirements than without, yielding a Lipschitz bound of independent interest. Empirically, we demonstrate our conclusions in simulated quadcopter navigation with a neural network agent and in CRM with an LLM agent. Our findings suggest that current efforts to enhance agentic safety may be insufficient, and point to a need for fundamentally different approaches.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper demonstrates that imposing safety constraints increases the Lipschitz constant, making cross-task generalization substantially harder.
It employs both theoretical analysis in LQR settings and empirical validations in robotics and LLM-based CRM, highlighting the generalization gap.
The study calls for novel representational approaches to decouple safety from task-specific control to enhance robust, safe AI performance.

Failure of Agentic Safety Generalization: Theory and Empirics

Problem Formulation and Motivation

This work analyzes the challenge of agentic safety generalizing across tasks in multi-task AI agents—a setting where an agent must execute task specifications provided post-deployment, often requiring both task proficiency and safe behavior under environment risk and uncertainty. Although empirical evidence frequently reveals that execution generalizes more readily than safe execution, prior literature has primarily speculated about algorithmic or data-centric explanations. This paper demonstrates through both theoretical and empirical arguments that this generalization gap is not solely due to training protocol limitations but emerges from fundamental properties of the safety constraints themselves.

The authors formalize "agentic safety" using two operational requirements: risk avoidance (preventing entry into pre-specified, dangerous states) and risk handling (robustness to environmental disturbances or adversarial events). They focus on imitation learning settings, where a student agent learns from a safe (or unsafe) teacher across a subset of possible tasks, measuring generalization error as divergence on unseen tasks of the same family.

Theoretical Analysis: Lipschitz Hardness of Safety-Imposed Mappings

The theoretical contribution derives from the formal characterization of task-to-policy mappings in the context of linear-quadratic control, specifically comparing the mapping from quadratic cost matrices $Q$ to the optimal controller with and without strong safety constraints (implemented via $\mathcal{H}_\infty$ -robustness).

The analysis proves that the task-to-controller map is provably "less smooth" (in the Lipschitz sense) when safety is imposed: the Lipschitz constant of the safe mapping $K_\text{safe}(\cdot)$ is strictly greater (by a task-independent multiplicative gap) than in the unsafe setting $K_\text{unsafe}(\cdot)$ . Specifically, for state matrices, input matrices, and reward matrices $A,B,R$ , under stability and alignment conditions, the authors establish

$\operatorname{Lip}(K_\text{safe}) \geq \alpha \cdot \operatorname{Lip}(K_\text{unsafe}) \, ,$

where $\alpha$ can be arbitrarily large with certain system parameters. This result is robust to the training regime—i.e., even in the infinite-sample regime, generalization is information-theoretically harder for safe agentic mappings. This connects to classic theory that links higher target function Lipschitz constants with increased statistical learning complexity, smaller sample efficiency, and larger worst-case generalization error.

A further technical contribution is the derivation of nontrivial upper bounds on the Lipschitz constant for the LQR map $K_\text{unsafe}(\cdot)$ as a function of $Q$ , filling a gap in control theory.

Empirical Validation Across Domains and Modalities

The empirical section substantiates these theoretical guarantees in three domains of increasing complexity: classical optimal control, simulated robotics, and LLM-based decision-making.

Linear-Quadratic Control (LDS with $\mathcal{H}_\infty$ -robustness): Neural network students are trained to imitate optimal policies—either unconstrained (unsafe) LQR or $\mathcal{H}_\infty$ 0-robust (safe)—across a subset of tasks differentiated by their cost matrices. On training tasks, both mappings are learned nearly perfectly. However, on unseen tasks, the cross-task generalization error explodes for the safe mapping, proportionally to the increased Lipschitz constant, while remaining moderate for the unsafe mapping.

Simulated Quadcopter Navigation: An NN-based policy controls a quadcopter in a risk-rich physical environment, where safety is either enforced (no-go region avoidance) or ignored. Again, student policies fitted to a safe expert generalize poorly to unseen target locations compared to those trained on an unsafe expert, aligning with the theory.

LLM-Based CRM Automation: An LLM agent is fine-tuned to perform complex, multi-step CRM operations in a web environment, either imitating a safe teacher (required to avoid data leaks, validate input, and provide critical data notifications) or an unsafe teacher (objective-only). Generalization error, measured as action prediction error on held-out CRM task templates, is much higher in the safety-constrained case, even though in-task (seen-template) performance is comparable.

In all cases, the magnitude of the generalization gap tracks the mathematical increase in the smoothness constant, substantiating the central claim with strong numerical evidence.

Implications and Future Directions

The formal and empirical results of the paper imply that current paradigms for achieving agentic safety through increased demonstration data or improved learning algorithms are fundamentally insufficient to resolve the task-level generalization gap. The difficulty arises from the "entanglement" between safety-relevant features and the structure of the task mapping: safe execution, especially in environments where risk manifests dynamically or adversarially, induces higher functional complexity and sensitivity.

The paper conjectures that progress will require new representational approaches to "factor" or decorrelate safety-relevant aspects from task-specific control, possibly via task- and safety-adaptive embedding architectures or tailor-made regularization. Developing such representations may be essential for closing the gap in safety generalization, especially as LLM-driven agents proliferate to real-world domains where specifications and hazards change post-deployment.

From a regulatory and commercial deployment perspective, these findings underscore that cross-task validation of execution proficiency is not sufficient—safety guarantees require more conservative and targeted approaches. It further indicates that safe policy generalization is an open, structurally-challenging problem distinct from generalization of nominal task performance.

Conclusion

The paper delivers a rigorous, systematic exposition showing that inherent mathematical properties of agentic safety requirements render safety generalization across tasks fundamentally more challenging than mere task execution generalization. Theoretical analysis identifies higher Lipschitz complexity as a primary obstacle, and diverse experiments validate that this effect holds in both classical control and modern neural architectures, including LLM-based agents. Therefore, agentic safety should be treated as an independent axis of generalization and robust safe AI design must move beyond scaling imitation on fixed task distributions. Future research must focus on representational innovations to enable safety transfer, as well as robust evaluation protocols that go beyond in-distribution and in-task assessment.

Markdown Report Issue