Papers
Topics
Authors
Recent
Search
2000 character limit reached

SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law

Published 24 Jul 2025 in cs.AI, cs.CL, and cs.CV | (2507.18576v1)

Abstract: We introduce SafeWork-R1, a cutting-edge multimodal reasoning model that demonstrates the coevolution of capabilities and safety. It is developed by our proposed SafeLadder framework, which incorporates large-scale, progressive, safety-oriented reinforcement learning post-training, supported by a suite of multi-principled verifiers. Unlike previous alignment methods such as RLHF that simply learn human preferences, SafeLadder enables SafeWork-R1 to develop intrinsic safety reasoning and self-reflection abilities, giving rise to safety `aha' moments. Notably, SafeWork-R1 achieves an average improvement of $46.54\%$ over its base model Qwen2.5-VL-72B on safety-related benchmarks without compromising general capabilities, and delivers state-of-the-art safety performance compared to leading proprietary models such as GPT-4.1 and Claude Opus 4. To further bolster its reliability, we implement two distinct inference-time intervention methods and a deliberative search mechanism, enforcing step-level verification. Finally, we further develop SafeWork-R1-InternVL3-78B, SafeWork-R1-DeepSeek-70B, and SafeWork-R1-Qwen2.5VL-7B. All resulting models demonstrate that safety and capability can co-evolve synergistically, highlighting the generalizability of our framework in building robust, reliable, and trustworthy general-purpose AI.

Summary

  • The paper introduces a multimodal reasoning model that coevolves safety and intelligence using a staged, verifier-guided RL pipeline.
  • It achieves significant safety improvements, including a 95.42% harmless response rate under adversarial attacks and superior value alignment benchmarks.
  • The methodology leverages both automated and human-in-the-loop interventions to maintain high general capability while ensuring robust safety compliance.

SafeWork-R1: Coevolving Safety and Intelligence under the AI-45^{\circ} Law

Introduction and Motivation

SafeWork-R1 is a multimodal reasoning model developed to address the persistent trade-off between safety and intelligence in LLMs. The work is motivated by the AI-45^{\circ} Law, which posits that safety and capability should co-evolve in a balanced manner, rather than being in opposition. Existing LLMs, despite their advanced reasoning and decision-making abilities, often exhibit critical safety vulnerabilities, including susceptibility to adversarial prompts, value misalignment, and over-refusal. SafeWork-R1, built upon the SafeLadder framework, aims to internalize safety as a native capability through progressive, safety-oriented reinforcement learning (RL) post-training, guided by a suite of neural and rule-based verifiers.

SafeLadder Framework: Technical Roadmap

The SafeLadder framework is a staged RL pipeline designed to optimize safety, general capability, efficiency, and knowledge calibration in (multimodal) LLMs. The pipeline consists of four key stages:

  1. CoT-SFT (Chain-of-Thought Supervised Fine-Tuning): Instills structured, human-like reasoning using high-quality, long-chain reasoning data, with rigorous validation and cognitive diversity analysis.
  2. M3^3-RL (Multimodal, Multitask, Multiobjective RL): Employs a two-stage curriculum and a custom CPGD algorithm to jointly optimize safety, value, knowledge, and general capabilities using multiobjective reward functions.
  3. Safe-and-Efficient RL: Introduces CALE (Conditional Advantage for Length-based Estimation) to promote efficient, concise reasoning, empirically shown to improve both safety and value alignment.
  4. Deliberative Search RL: Formalizes an iterative, action-based process (THINK, SEARCH, READ) for integrating external knowledge, with dynamic reward-weighting via Lagrangian optimization to balance accuracy and reliability. Figure 1

    Figure 1: The roadmap of SafeLadder.

This staged optimization is supported by a scalable RL infrastructure (SafeWork-T1), enabling verifier-agnostic, high-throughput training across thousands of GPUs.

Verifier Suite: Safety, Value, and Knowledge

Safety Verifier

A bilingual, multimodal verifier trained on 45k high-quality samples, covering 10 major and 400 subcategories of safety risks. It achieves leading accuracy and F1 scores on public and proprietary safety benchmarks, outperforming both open-source and proprietary baselines.

Value Verifier

An interpretable, multimodal reward model trained on 80k samples spanning 70+ value-related scenarios. It supports both CoT-style interpretability and continuous scoring, achieving SOTA performance (88.2% average) across public and internal value alignment benchmarks.

Knowledge Verifier

Designed to penalize speculative, low-confidence correct answers, the knowledge verifier is trained on 120k multi-domain questions with explicit confidence annotation. It outperforms proprietary models on point-wise knowledge reward benchmarks, especially at scale. Figure 2

Figure 2: The development workflow of the knowledge verifier, penalizing low-confidence correct answers to discourage speculative reasoning.

M3^3-RL: Multimodal, Multitask, Multiobjective RL

M3^3-RL is a two-stage RL framework:

  • Stage 1: Prioritizes general capability.
  • Stage 2: Jointly optimizes safety, value, and general capability.

The CPGD algorithm ensures stable policy updates, and the multiobjective reward function balances visual grounding, helpfulness, format, and task-aware objectives. Multimodal jailbreak data augmentation is used to improve robustness against adversarial attacks. Figure 3

Figure 3: Overview of the M3^3-RL training framework, showing sequential optimization of general and safety-related capabilities.

Figure 4

Figure 4: M3^3-RL data augmentation pipeline for robust multimodal jailbreak resistance.

Inference-Time Interventions

Automated Intervention: Principled Value Model (PVM) Guidance

At inference, a Gating module generates a Routing Vector to dynamically weight safety, value, and knowledge dimensions. Candidate continuations are scored by specialized PVMs, and the highest-scoring candidate is selected at each step, enabling fine-grained, context-sensitive alignment. Figure 5

Figure 5: PVM guidance mechanism for inference-time alignment, dynamically prioritizing safety, value, or knowledge as needed.

This method yields a substantial increase in safety scores (from 77.1 to 93.8) without significant loss in value or knowledge performance.

Human-in-the-Loop Intervention

A text-editing interface allows users to directly correct reasoning chains (CoT), with edit-distance tracking and iterative simplification. This approach outperforms dialogue-based correction, especially on complex, multi-step tasks, and generalizes well to related queries. Figure 6

Figure 6: Framework of human intervention on CoT, enabling efficient, fine-grained correction and adaptation.

Empirical Results

Safety and Value Alignment

SafeWork-R1 achieves an average safety rate of 89.2% across four multimodal safety benchmarks, outperforming GPT-4.1 and Claude Opus 4. On value alignment (FLAMES, M3^3oralBench), it demonstrates a 26.2% improvement over its base model and competitive performance with leading proprietary models. Figure 7

Figure 7: Performance comparison on safety and general benchmarks.

General Reasoning and Multimodal Capability

SafeWork-R1 improves general reasoning performance by 13.45% over its base model across seven benchmarks, with strong results on MMMU, MathVista, and GAOKAO-MM. Notably, safety improvements do not compromise general capability.

Safety Aha Moments and Representation Analysis

Information-theoretic analysis reveals the emergence of "safety MI peaks"—tokens where mutual information between internal representations and safe reference answers surges, corresponding to safety-relevant words. Efficient reasoning training amplifies these safety signals and reduces ambiguous transitions. Figure 8

Figure 8: (a) Illustration of safety mutation information peaks. (b) Distribution of tokens at MI peaks for SafeWork-R1-Qwen2.5VL-7B.

Figure 9

Figure 9: Frequency of tokens at Safety MI peaks for Qwen2.5-VL-7B under different training regimes.

Red Teaming and Jailbreak Resistance

SafeWork-R1 achieves a Harmless Response Rate (HRR) of 95.42% (single-turn) and 90.24% (multi-turn) under advanced jailbreak attacks, surpassing GPT-4o and Gemini-2.5, and matching Claude in resilience.

Search with Calibration

SafeWork-R1 demonstrates superior reliability in knowledge-intensive search tasks, with a markedly lower False-Certain (FC%) rate compared to proprietary models, highlighting its ability to avoid overconfident hallucinations.

Human Evaluation

Human studies confirm that SafeWork-R1 is perceived as more trustworthy, rational, and less prone to negative or deceptive communication strategies compared to other leading models. Figure 10

Figure 10: Distribution of all models in different dimensions of the human evaluation framework.

RL Infrastructure: SafeWork-T1

SafeWork-T1 is a layered RLVR platform supporting efficient, modular, and scalable RL training with verifier colocation and dynamic data balancing. It achieves >30% higher throughput than existing RLHF frameworks and enables rapid prototyping of new verifiers and reward models. Figure 11

Figure 11: System layer overview of SafeWork-T1, supporting scalable, verifier-agnostic RL training.

Figure 12

Figure 12: RLVR training pipeline of SafeWork-T1, featuring universal colocation and dynamic data balance.

Implications and Future Directions

SafeWork-R1 empirically demonstrates that safety and general capability can co-evolve synergistically, challenging the prevailing assumption of an inherent trade-off. The staged, verifier-guided RL paradigm is generalizable across model architectures and scales, as validated on Qwen2.5-VL-7B, InternVL3-78B, and DeepSeek-70B. The integration of efficient reasoning, inference-time alignment, and human-in-the-loop correction provides a robust foundation for trustworthy, real-world AI deployment.

Key future directions include:

  • Further exploration of efficient, trustworthy reasoning methodologies.
  • Development of error vector databases and test-time adaptation for user alignment.
  • Linguistic calibration mechanisms to optimize user-centered interaction.
  • Extension of SafeLadder to increasingly powerful foundation models in pursuit of safe AGI.

Conclusion

SafeWork-R1, via the SafeLadder framework, establishes a scalable, empirically validated methodology for the coevolution of safety and intelligence in LLMs. The integration of multi-principled verifiers, staged RL, and advanced inference-time interventions results in a model that achieves state-of-the-art safety without sacrificing general capability. The work provides both a practical blueprint and theoretical insights for the development of robust, reliable, and trustworthy general-purpose AI systems.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 1 like about this paper.