Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Published 3 Mar 2026 in cs.CL | (2603.03205v1)

Abstract: Agentic LLMs operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which captures safety distinctions often missed by scalar rewards. We evaluate MOSAIC zero-shot across three model families, Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, and across out-of-distribution benchmarks spanning harmful tasks, prompt injection, benign tool use, and cross-domain privacy leakage. MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance, demonstrating robust generalization across models, domains, and agentic settings.

Abstract PDF Upgrade to Chat

Summary

The paper introduces the MOSAIC framework that enforces explicit agentic safety decisions (plan, check, act/refuse) to mitigate irreversible harm in multi-step tool use.
It employs preference-based reinforcement learning to discern safer trajectories and improve refusal rates while maintaining high benign task completion.
Empirical results show significant gains in safety benchmarks and token efficiency across both frontier and open-source models.

Explicit Safety Alignment for Agentic LLMs: The MOSAIC Framework

Motivation and Problem Statement

The proliferation of agentic LMs with tool-use capabilities introduces fundamental safety concerns beyond static generation. These models plan, call tools, and execute actions across long horizons, which can result in irreversible harm from a single misstep, such as file access or credential entry. Prior alignment protocols optimized for conversational models are inadequate in these sequential, adversarial, and tool-mediated environments due to overconfident reasoning, implicit safety decisions, and vulnerability to prompt injection. This challenge is particularly acute for small LLMs (SLMs) with constrained world modeling and context budgets, which increases susceptibility to anomalous tool feedback and cascading failures.

The MOSAIC Architecture

MOSAIC addresses these gaps via a modular, post-training framework that structurally enforces explicit agentic safety decisions. The core loop is plan/think, check, then act or refuse, with explicit <safety_thoughts> and refusal as first-class, learnable actions. The framework operationalizes safety as a discrete inference stage, allowing selective computation on high-risk steps and concise, auditable trajectories for benign tasks.

Key elements include:

Structured Reasoning Loop: Agents produce a plan, optionally invoke a safety check using <safety_thoughts>, and select actions (tool call, refuse, answer), with refusal explicitly part of the action space.
Safety Reasoning Blocks: Modular reasoning targets risk screening, tool hazards, and irreversibility, surfaced before critical actions.
Dynamic Safety Gating: Invocation of safety checks is learned end-to-end, not fixed or heuristic, enabling token-efficient reasoning.
Refusal Tool Semantics: Agents can halt execution with justification via a refusal tool, preventing compounding unsafe actions.

This design decouples safety from generic planning, supporting explicit and auditable abstention and verification steps.

Preference-Based Reinforcement Fine-Tuning

MOSAIC trains policies using preference-based RL, leveraging trajectory-level pairwise comparisons by an LLM judge instead of outcome-based scalar rewards. The LLM judge selects safer trajectories, capturing temporal distinctions such as early refusal versus late abort, which scalar reward models systematically fail to encode. This learning signal is crucial in agentic settings with identical end states but divergent intermediate safety profiles.

Optimization employs Group Relative Policy Optimization (GRPO), maximizing composite rewards including safety alignment, output structure, and token efficiency. Masked gradient updates ensure only model-generated text (not tool output) is trained, focusing learning on decision-making, safety, and action selection.

Empirical Findings

Safety and Utility

Frontier Models: Without explicit safety scaffolding, models such as GPT-4o and GPT-5 entirely fail on agent safety metrics, never refusing harmful requests and exhibiting high vulnerability to injection attacks. Once MOSAIC is applied, harmful-task refusal rates rise above 90%, harmful scores drop by more than 75%, and benign task completion remains high ( $CR > 0.93$ ).
Open-Source SLMs: MOSAIC induces model-adaptive gains:
- Qwen2.5-7B-Instruct: Harmful-task score halves (50% reduction), refusal rises from 0.74 to 0.87, injection robustness increases, with minimal benign-task utility degradation.
- Qwen3-4B-Thinking: Completion rate on benign tasks nearly doubles (0.44 → 0.85), reducing reasoning loops and improving execution reliability; injection robustness increases.
- Phi-4: Over-refusal on benign tasks drops by 56%, completion rate improves to 0.91, with a slight increase in vulnerability to injection indicating a calibrated trade-off.

Robustness and Generalization

MOSAIC-trained open models outperform unscaffolded frontier models on safety benchmarks, closing the gap once explicit safety reasoning is embedded.
Improvements generalize to cross-domain privacy tasks (PrivacyLens), with up to 23% reduction in leakage rate and maintenance of helpfulness, demonstrating transferability of agentic safety training beyond harm prevention.

Token Efficiency

Safety reasoning is dynamically invoked based on context risk, maintaining safety tokens below 20% of total token usage even on harmful tasks.
Explicit safety checks eliminate excessive internal reasoning for verbose models, yielding up to 4x reduction in tokens per turn (Qwen3-4B) without loss in safety or utility.

Ablation Studies

Explicit Safety Checks: Removing structured safety checks and relying only on generic think blocks causes significant safety regressions, with refusal rates dropping and harm scores increasing. Refusal-only strategies are brittle under prompt injection and adversarial tool outputs.
Pairwise vs. Pointwise Rewards: Training with scalar outcome rewards fails to reliably distinguish early refusal and robust safety, resulting in higher harm and attack success rates. Pairwise preference rewards preserve temporal safety distinctions and deliver superior alignment.

Implications and Future Directions

MOSAIC demonstrates that agentic safety is not an emergent property of model scale or naïve refusal mechanisms but requires structured inference, modular reasoning, and trajectory-level preference learning. The framework is extensible to varied domains and agentic architectures, enabling principled, model-adaptive safety alignment in tool-using environments. The explicit separation of planning, safety checking, and action selection sets the stage for more nuanced abstention strategies and verifiable agent operation in real-world settings. Preference-based RL via LLM judges is shown to be essential for stable, data-efficient learning of temporal safety decisions, with implications for future reward modeling and agent auditing.

Conclusion

MOSAIC is a modular safety alignment framework for agentic LLMs operating in multi-step tool-use environments. It enforces explicit safety reasoning and refusal as core decisions, trained via RL on pairwise trajectory preferences. Empirical results highlight robust, generalized safety gains, token-efficient reasoning, and preservation of utility across open and frontier models. The research advances the field by demonstrating that structured alignment protocols and preference-based RL are necessary for reliable agentic behavior, rather than scale or post-hoc moderation. Future directions include integration with more complex tool catalogs, refinement of preference models, and further theoretical analysis of safety-utility trade-offs in sequential agent environments.

Reference: "Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use" (2603.03205)

Markdown Report Issue