ReMoDetect: Reward Models Recognize Aligned LLM's Generations
Abstract: The remarkable capabilities and easy accessibility of LLMs have significantly increased societal risks (e.g., fake news generation), necessitating the development of LLM-generated text (LGT) detection methods for safe usage. However, detecting LGTs is challenging due to the vast number of LLMs, making it impractical to account for each LLM individually; hence, it is crucial to identify the common characteristics shared by these models. In this paper, we draw attention to a common feature of recent powerful LLMs, namely the alignment training, i.e., training LLMs to generate human-preferable texts. Our key finding is that as these aligned LLMs are trained to maximize the human preferences, they generate texts with higher estimated preferences even than human-written texts; thus, such texts are easily detected by using the reward model (i.e., an LLM trained to model human preference distribution). Based on this finding, we propose two training schemes to further improve the detection ability of the reward model, namely (i) continual preference fine-tuning to make the reward model prefer aligned LGTs even further and (ii) reward modeling of Human/LLM mixed texts (a rephrased texts from human-written texts using aligned LLMs), which serves as a median preference text corpus between LGTs and human-written texts to learn the decision boundary better. We provide an extensive evaluation by considering six text domains across twelve aligned LLMs, where our method demonstrates state-of-the-art results. Code is available at https://github.com/hyunseoklee-ai/ReMoDetect.
- Language models are unsupervised multitask learners. OpenAI blog, vol. 1, no. 8, p. 9, 2019.
- Conda: Contrastive domain adaptation for ai-generated text detection. In Annual Conference of the Association for Computational Linguistics, 2023.
- M. Andrew. Laion-ai/open-assistant. https://github.com/LAION-AI/Open-Assistant, 2023.
- Hierarchical neural story generation. In Annual Conference of the Association for Computational Linguistics, 2018.
- Antropic. Introducing the next generation of claude. https://www.anthropic.com/news/claude-3-family, 2024.
- How close is chatgpt to human experts? com- parison corpus, evaluation, and detection. CoRR abs/2301.07597, 2023.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, 2020.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Detectgpt: Zero-shot machine-generated text detection using probability curvature. In International Conference on Machine Learning, 2023.
- Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn. arXiv preprint arXiv:2306.08640, 2023.
- Fast-detectgpt: Efficient zero-shot detection of machine-generated text via conditional probability curvature. In International Conference on Learning Representations, 2024.
- Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing, 2023a.
- MGTBench: Benchmarking Machine-Generated Text Detection. arXiv preprint arXiv:2303.14822, 2023b.
- Deep anomaly detection with outlier exposure. In International Conference on Learning Representations, 2019.
- Augmix: A simple data processing method to improve robustness and uncertainty. In International Conference on Learning Representations, 2020.
- Pixmix: Dreamlike pictures comprehensively improve safety measures. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691, 2024.
- Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203, 2019.
- Detectllm: Leveraging log-rank information for zero-shot detection of machine-generated text. arXiv preprint arXiv:2306.05540, 2023.
- Do language models plagiarize? In In Proceedings of the ACM Web Conference, 2023.
- Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021.
- Training confidence-calibrated classifiers for detecting out-of-distribution samples. In International Conference on Learning Representations, 2018.
- Microsoft. Phi-3 technical report: A highly capable language model locally on your phone, 2024.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, 2022.
- Llm evaluators recognize and favor their own generations. arXiv preprint arXiv:2404.13076, 2024.
- Pubmedqa: A dataset for biomedical research question answering. In n Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.
- Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems, 2023.
- Experience replay for continual learning. In Advances in Neural Information Processing Systems, 2019.
- Defending against neural fake news. arXiv preprint arXiv:1905.12616, 2021.
- Deep semi-supervised anomaly detection. In International Conference on Learning Representations, 2020.
- Gltr: Statistical detection and visualization of generated text. In Annual Conference of the Association for Computational Linguistics, 2019.
- Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Conference on Empirical Methods in Natural Language Processing, 2018.
- Mastering the game of go with deep neural networks and tree search. Nature, 529:484–503, 2016. URL http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html.
- Csi: Novelty detection via contrastive learning on distributionally shifted instances. In Advances in Neural Information Processing Systems, 2020.
- G. Team. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- E. Tian and A. Cui. Gptzero: Towards detection of ai-generated text using zero-shot and supervised methods, 2023. URL https://gptzero.me.
- Multiscale positive-unlabeled detection of ai-generated texts, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156, 2023.
- Ghostbuster: Detecting text ghostwritten by large language models. In CoRR abs/2305.15047, 2023.
- B. Wang and A. Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
- Jailbroken: How does llm safety training fail? In Advances in Neural Information Processing Systems, 2023.
- Evaluation of chatgpt and microsoft bing ai chat performances on physics exams of vietnamese national high school graduation examination. arXiv preprint arXiv:2306.04538, 2023.
- mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.