Wukong Framework for Not Safe For Work Detection in Text-to-Image systems

Published 1 Aug 2025 in cs.CV, cs.AI, and cs.CR | (2508.00591v1)

Abstract: Text-to-Image (T2I) generation is a popular AI-generated content (AIGC) technology enabling diverse and creative image synthesis. However, some outputs may contain Not Safe For Work (NSFW) content (e.g., violence), violating community guidelines. Detecting NSFW content efficiently and accurately, known as external safeguarding, is essential. Existing external safeguards fall into two types: text filters, which analyze user prompts but overlook T2I model-specific variations and are prone to adversarial attacks; and image filters, which analyze final generated images but are computationally costly and introduce latency. Diffusion models, the foundation of modern T2I systems like Stable Diffusion, generate images through iterative denoising using a U-Net architecture with ResNet and Transformer blocks. We observe that: (1) early denoising steps define the semantic layout of the image, and (2) cross-attention layers in U-Net are crucial for aligning text and image regions. Based on these insights, we propose Wukong, a transformer-based NSFW detection framework that leverages intermediate outputs from early denoising steps and reuses U-Net's pre-trained cross-attention parameters. Wukong operates within the diffusion process, enabling early detection without waiting for full image generation. We also introduce a new dataset containing prompts, seeds, and image-specific NSFW labels, and evaluate Wukong on this and two public benchmarks. Results show that Wukong significantly outperforms text-based safeguards and achieves comparable accuracy of image filters, while offering much greater efficiency.

Abstract PDF Upgrade to Chat

Summary

The paper introduces the Wukong framework that detects NSFW content early in the denoising process of T2I systems.
It employs a U-Net encoder and transformer decoder with category-specific queries, achieving over 20% accuracy improvement versus text-based filters.
The framework reduces computational demands by halting generation early and demonstrates robust performance against adversarial prompts.

Wukong Framework for NSFW Detection in Text-to-Image Systems

The paper "Wukong Framework for Not Safe For Work Detection in Text-to-Image systems" (2508.00591) presents an innovative approach to address the challenge of detecting NSFW (Not Safe For Work) content within text-to-image (T2I) generation systems, such as Stable Diffusion. This framework seeks to enhance both the efficiency and accuracy of NSFW detection by leveraging the insights derived from diffusion processes.

Introduction

At the core of contemporary T2I systems, diffusion models iterate on denoising processes to create images from latent noise guided by textual prompts. The paper identifies that early denoising steps crucially define the semantic layout of images, and cross-attention layers within U-Net architectures play a vital role in mapping textual concepts to image regions. Based on these observations, the authors have proposed Wukong, a novel framework designed to detect NSFW content before the complete image synthesis occurs.

Figure 1: An illustrative example of modifying the textual condition during the early denoising steps in the Stable Diffusion process.

Wukong Framework

U-Net-Based Encoder

The Wukong framework utilizes the U-Net from Stable Diffusion as an encoder to extract intermediate latent representations during early denoising stages. By focusing on these early steps, the framework aims to detect NSFW content efficiently, significantly reducing computational demands compared to traditional image analysis methods.

Transformer-Based Decoder

A key component of Wukong is a transformer-based decoder that processes intermediate outputs from U-Net’s cross-attention layers to identify NSFW content. The framework operates with transformer-based attention mechanisms, using category-specific NSFW queries that help pinpoint unsafe content features within the latent noise representation. This approach ensures that the framework is robust against semantic variations and adversarial attacks on textual prompts.

Figure 2: Visualization of Attention Maps during the denoising steps showcasing cross-attention layer outputs.

Dataset and Evaluation

The paper introduces a new dataset, Wukong-Demons, which includes text prompts, generator seeds, and NSFW category-specific labels. This dataset allows for a detailed evaluation of the framework, highlighting its effectiveness in achieving higher accuracy and efficiency compared to existing text-based safeguards and even competing with image-based methodologies.

Performance

The experimental results demonstrate that Wukong outperforms text-based filters (e.g., OpenAI Moderation), achieving accuracy improvements exceeding 20% on average. It offers detection speeds several times faster than image-based classifiers by halting generation early if unsafe content is detected.

Robustness Against Adversarial Prompts

The method maintains resilience under adversarial scenarios, effectively identifying NSFW content even when prompts are crafted to bypass traditional safeguards through intentional wording obfuscation.

Impact of Denoising Step $T_C$

The paper provides an analysis of the impact of $T_C$ , the classification step within the denoising process. The results indicate substantial efficiency gains and demonstrate that meaningful detection can occur very early, even with less than 10 denoising steps.

Conclusion

The Wukong framework offers a significant advancement in the detection of NSFW content within T2I systems, providing both practical efficiency and robustness. It integrates seamlessly into existing diffusion pipelines, offering a strategy that enhances safety without compromising performance for commercial T2I deployments. Given the capabilities and contributions of the Wukong framework, future developments could focus on further refining model-specific safeguard strategies and expanding the scope of NSFW detection into broader categories within multimedia systems.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Sign Up to Generate All Videos Subscribe on YouTube

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Sign Up to Generate

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Top Community Prompts

Explain it Like I'm 14

Practical Applications

Conceptual Simplification

Sign Up to Activate View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

Authors (3)

Collections

Sign up for free to add this paper to one or more collections.

YouTube

Show All Videos

alphaXiv

Wukong Framework for Not Safe For Work Detection in Text-to-Image systems (2 likes, 0 questions)