Papers
Topics
Authors
Recent
Search
2000 character limit reached

Native Language Identification with Large Language Models

Published 13 Dec 2023 in cs.CL | (2312.07819v1)

Abstract: We present the first experiments on Native Language Identification (NLI) using LLMs such as GPT-4. NLI is the task of predicting a writer's first language by analyzing their writings in a second language, and is used in second language acquisition and forensic linguistics. Our results show that GPT models are proficient at NLI classification, with GPT-4 setting a new performance record of 91.7% on the benchmark TOEFL11 test set in a zero-shot setting. We also show that unlike previous fully-supervised settings, LLMs can perform NLI without being limited to a set of known classes, which has practical implications for real-world applications. Finally, we also show that LLMs can provide justification for their choices, providing reasoning based on spelling errors, syntactic patterns, and usage of directly translated linguistic patterns.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. Unravelling interlanguage facts via explainable machine learning. Digital Scholarship in the Humanities, 38(3):953–977.
  2. TOEFL11: A corpus of non-native English. ETS Research Report Series, 2013(2):i–15.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Native-like expression identification by contrasting native and proficient second language speakers. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5843–5854, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  5. A deep generative approach to native language identification. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1778–1783, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  6. Shervin Malmasi. 2016. Native Language Identification: Explorations and Applications. Ph.D. thesis, Macquarie University.
  7. Shervin Malmasi and Aoife Cahill. 2015. Measuring feature diversity in native language identification. In Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 49–55, Denver, Colorado. Association for Computational Linguistics.
  8. Shervin Malmasi and Mark Dras. 2014a. Arabic native language identification. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), pages 180–186, Doha, Qatar. Association for Computational Linguistics.
  9. Shervin Malmasi and Mark Dras. 2014b. Chinese native language identification. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers, pages 95–99, Gothenburg, Sweden. Association for Computational Linguistics.
  10. Shervin Malmasi and Mark Dras. 2014c. Language transfer hypotheses with linear SVM weights. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1385–1390, Doha, Qatar. Association for Computational Linguistics.
  11. Shervin Malmasi and Mark Dras. 2015. Large-scale native language identification with cross-corpus evaluation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1403–1409, Denver, Colorado. Association for Computational Linguistics.
  12. Shervin Malmasi and Mark Dras. 2017. Multilingual native language identification. Natural Language Engineering, 23(2):163–215.
  13. Shervin Malmasi and Mark Dras. 2018. Native language identification with classifier stacking and ensembles. Computational Linguistics, 44(3):403–446.
  14. A report on the 2017 native language identification shared task. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pages 62–75, Copenhagen, Denmark. Association for Computational Linguistics.
  15. Oracle and human baselines for native language identification. In Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 172–178, Denver, Colorado. Association for Computational Linguistics.
  16. NLI shared task 2013: MQ submission. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 124–133, Atlanta, Georgia. Association for Computational Linguistics.
  17. OpenAI. 2023. Gpt-4 technical report.
  18. Stian Steinbakken and Björn Gambäck. 2020. Native-language identification with attention. In Proceedings of the 17th International Conference on Natural Language Processing (ICON), pages 261–271, Indian Institute of Technology Patna, Patna, India. NLP Association of India (NLPAI).
  19. Ahmet Yavuz Uluslu and Gerold Schneider. 2022. Scaling native language identification with transformer adapters. In Proceedings of the 5th International Conference on Natural Language and Speech Processing (ICNLSP 2022), pages 298–302, Trento, Italy. Association for Computational Linguistics.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 279 likes about this paper.