Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fine-tuning Large Language Models for Multigenerator, Multidomain, and Multilingual Machine-Generated Text Detection

Published 22 Jan 2024 in cs.CL and cs.AI | (2401.12326v1)

Abstract: SemEval-2024 Task 8 introduces the challenge of identifying machine-generated texts from diverse LLMs in various languages and domains. The task comprises three subtasks: binary classification in monolingual and multilingual (Subtask A), multi-class classification (Subtask B), and mixed text detection (Subtask C). This paper focuses on Subtask A & B. Each subtask is supported by three datasets for training, development, and testing. To tackle this task, two methods: 1) using traditional ML with natural language preprocessing (NLP) for feature extraction, and 2) fine-tuning LLMs for text classification. The results show that transformer models, particularly LoRA-RoBERTa, exceed traditional ML methods in effectiveness, with majority voting being particularly effective in multilingual contexts for identifying machine-generated texts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  2. Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794.
  3. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
  4. Free dolly: Introducing the world’s first truly open instruction-tuned llm.
  5. Roft: A tool for evaluating human detection of machine-generated text. arXiv preprint arXiv:2010.03070.
  6. Gltr: Statistical detection and visualization of generated text. arXiv preprint arXiv:1906.04043.
  7. How close is chatgpt to human experts? comparison corpus, evaluation, and detection. arXiv preprint arXiv:2301.07597.
  8. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  9. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel.
  10. A watermark for large language models. arXiv preprint arXiv:2301.10226.
  11. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  12. Detectgpt: Zero-shot machine-generated text detection using probability curvature. arXiv preprint arXiv:2301.11305.
  13. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
  14. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108.
  15. Brian Scott. 2023. The gunning’s fog index (or fog) readability formula.
  16. Rexhep Shijaku and Ercan Canhasi. 2023. Chatgpt generated text detection. Publisher: Unpublished.
  17. The science of detecting llm-generated texts. arXiv preprint arXiv:2303.07205.
  18. Intrinsic dimension estimation for robust detection of ai-generated texts. arXiv preprint arXiv:2306.04723.
  19. M4: Multi-generator, multi-domain, and multi-lingual black-box machine-generated text detection. arXiv preprint arXiv:2305.14902.
  20. Wataru Zaitsu and Mingzhe Jin. 2023. Distinguishing chatgpt (-3.5,-4)-generated and human-written papers through japanese stylometric analysis. arXiv preprint arXiv:2304.05534.
  21. Defending against neural fake news. Advances in neural information processing systems, 32.
  22. Semi-supervised url segmentation with recurrent neural networks pre-trained on knowledge graph entities. arXiv preprint arXiv:2011.03138.
Citations (2)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.