Can AI-Generated Text be Reliably Detected?
Abstract: LLMs perform impressively well in various applications. However, the potential for misuse of these models in activities such as plagiarism, generating fake news, and spamming has raised concern about their responsible use. Consequently, the reliable detection of AI-generated text has become a critical area of research. AI text detectors have shown to be effective under their specific settings. In this paper, we stress-test the robustness of these AI text detectors in the presence of an attacker. We introduce recursive paraphrasing attack to stress test a wide range of detection schemes, including the ones using the watermarking as well as neural network-based detectors, zero shot classifiers, and retrieval-based detectors. Our experiments conducted on passages, each approximately 300 tokens long, reveal the varying sensitivities of these detectors to our attacks. Our findings indicate that while our recursive paraphrasing method can significantly reduce detection rates, it only slightly degrades text quality in many cases, highlighting potential vulnerabilities in current detection systems in the presence of an attacker. Additionally, we investigate the susceptibility of watermarked LLMs to spoofing attacks aimed at misclassifying human-written text as AI-generated. We demonstrate that an attacker can infer hidden AI text signatures without white-box access to the detection method, potentially leading to reputational risks for LLM developers. Finally, we provide a theoretical framework connecting the AUROC of the best possible detector to the Total Variation distance between human and AI text distributions. This analysis offers insights into the fundamental challenges of reliable detection as LLMs continue to advance. Our code is publicly available at https://github.com/vinusankars/Reliability-of-AI-text-detectors.
- A watermark for large language models. arXiv preprint arXiv:2301.10226, 2023a.
- Detectgpt: Zero-shot machine-generated text detection using probability curvature. arXiv preprint arXiv:2301.11305, 2023a.
- OpenAI. Gpt-2: 1.5b release. November 2019. URL https://openai.com/research/gpt-2-1-5b-release.
- Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense, 2023.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Opt: Open pre-trained transformer language models, 2022. URL https://arxiv.org/abs/2205.01068.
- Exploring the limits of transfer learning with a unified text-to-text transformer, 2019. URL https://arxiv.org/abs/1910.10683.
- OpenAI. Chatgpt: Optimizing language models for dialogue. November 2022. URL https://openai.com/blog/chatgpt/.
- Generating sentiment-preserving fake online reviews using neural language models and their human-and machine-based detection. In Advanced Information Networking and Applications: Proceedings of the 34th International Conference on Advanced Information Networking and Applications (AINA-2020), pages 1341–1354. Springer, 2020.
- Max Weiss. Deepfake bot submissions to federal public comment websites cannot be distinguished from human submissions. Technology Science, 2019121801, 2019.
- Jon Christian. Cnet secretly used ai on articles that didn’t disclose that fact, staff say. January 2023. URL https://futurism.com/cnet-ai-articles-label.
- Automatic detection of machine generated text: A critical survey. arXiv preprint arXiv:2011.01314, 2020.
- Real or fake? learning to discriminate machine from human generated text. arXiv preprint arXiv:1906.03351, 2019.
- Tweepfake: About detecting deepfake tweets. arxiv. arXiv preprint arXiv:2008.00036, 2020.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- Cuda: Convolution-based unlearnable datasets. arXiv preprint arXiv:2303.04278, 2023.
- Certifying model accuracy under distribution shifts. arXiv preprint arXiv:2201.12440, 2022.
- Improved certified defenses against data poisoning with (deterministic) finite aggregation. In International Conference on Machine Learning, pages 22769–22783. PMLR, 2022.
- Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203, 2019.
- Automatic detection of generated text is easiest when humans are fooled. arXiv preprint arXiv:1911.00650, 2019.
- Gltr: Statistical detection and visualization of generated text. arXiv preprint arXiv:1906.04043, 2019.
- Natural language watermarking: Design, analysis, and a proof-of-concept implementation. In Information Hiding: 4th International Workshop, IH 2001 Pittsburgh, PA, USA, April 25–27, 2001 Proceedings 4, pages 185–200. Springer, 2001.
- Linguistic steganography on twitter: hierarchical language modeling with manual interaction. In Media Watermarking, Security, and Forensics 2014, volume 9028, pages 9–25. SPIE, 2014.
- Protecting language generation models via invisible watermarking. arXiv preprint arXiv:2302.03162, 2023.
- Max Wolff. Attacking neural text detectors. CoRR, abs/2002.11768, 2020. URL https://arxiv.org/abs/2002.11768.
- Scott Aaronson. My ai safety lecture for ut effective altruism. November 2022. URL https://scottaaronson.blog/?p=6823.
- Gpt detectors are biased against non-native english writers. arXiv preprint arXiv:2304.02819, 2023.
- Pegasus: Pre-training with extracted gap-sentences for abstractive summarization, 2019.
- Prithiviraj Damodaran. Parrot: Paraphrase generation for nlu., 2021.
- Large language models can be guided to evade ai-generated text detection, 2023.
- On the reliability of watermarks for large language models, 2023b.
- Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ArXiv, abs/1808.08745, 2018.
- Language models are unsupervised multitask learners. 2019.
- Comparison of two pseudo-random number generators. In Advances in Cryptology: Proceedings of CRYPTO ’82, pages 61–78. Plenum, 1982.
- How to generate cryptographically strong sequences of pseudorandom bits. SIAM Journal on Computing, 13(4):850–864, 1984. doi: 10.1137/0213053. URL https://doi.org/10.1137/0213053.
- On the use of arxiv as a dataset, 2019.
- OpenAI. Gpt-4 technical report. March 2023. URL https://cdn.openai.com/papers/gpt-4.pdf.
- On the possibilities of ai-generated text detection, 2023.
- Detectgpt: Zero-shot machine-generated text detection using probability curvature. OpenReview, 2023b. URL https://openreview.net/pdf?id=UiAyIILXRd.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.