Comparing human and LLM proofreading in L2 writing: Impact on lexical and syntactic features

Published 10 Jun 2025 in cs.CL | (2506.09021v2)

Abstract: This study examines the lexical and syntactic interventions of human and LLM proofreading aimed at improving overall intelligibility in identical second language writings, and evaluates the consistency of outcomes across three LLMs (ChatGPT-4o, Llama3.1-8b, Deepseek-r1-8b). Findings show that both human and LLM proofreading enhance bigram lexical features, which may contribute to better coherence and contextual connectedness between adjacent words. However, LLM proofreading exhibits a more generative approach, extensively reworking vocabulary and sentence structures, such as employing more diverse and sophisticated vocabulary and incorporating a greater number of adjective modifiers in noun phrases. The proofreading outcomes are highly consistent in major lexical and syntactic features across the three models.

Abstract PDF Upgrade to Chat

Summary

The paper shows that both human and LLM proofreading significantly improve lexical sophistication and syntactic complexity in L2 writing, with LLMs employing broader generative rewrites.
It employs quantitative measures such as bigram mutual information and type-token ratio to compare the effects of three distinct LLMs against traditional human methods.
The paper underscores the need for refined assessment frameworks that balance immediate fluency improvements with authentic language development in educational contexts.

A Comparative Analysis of Human and LLM Proofreading in L2 Writing

This study provides a detailed investigation into the comparative effectiveness of human and LLM proofreading in second language (L2) writing, focusing on its impact on lexical and syntactic features. The research examines the performance of three LLMs—ChatGPT-4o, Llama 3.1-8b, and Deepseek-r1-8b—in editing identical L2 texts, assessing how these models differ from traditional human proofreading techniques in terms of lexical sophistication, syntactic complexity, and overall text intelligibility.

Key Findings

The research reveals that both human and LLM interventions significantly enhance the lexical sophistication of L2 writings by improving bigram mutual information, which suggests increased coherence in word sequencing. However, LLMs demonstrate a broader scope in their edits, opting for more generative rewrites that incorporate a diverse and sophisticated vocabulary. This generative approach fundamentally distinguishes LLM proofreading from human proofreading, which generally maintains the original content's structural and stylistic attributes while correcting grammatical and lexical inaccuracies.

Quantitatively, LLM proofreading notably increased lexical diversity measures such as the type-token ratio (mattr), and decreased word concreteness (b_concreteness), suggesting a propensity towards more abstract vocabulary. Further, LLMs achieved a broader usage of unique word types (ntypes). Syntactically, LLMs increased structural complexity by incorporating more nonfinite clauses, adjective modifiers, and nominalizations, which suggests a shift towards more complex sentence structures.

A noteworthy result is the consistency observed across the three LLMs in affecting major lexical and syntactic features. Despite these models being developed independently with different architectures, they registered high internal consistency, particularly in their lexical adjustments, as evidenced by Cronbach’s alpha values.

Implications

The implications of these findings are manifold. On a practical level, the LLMs' ability to predictably enhance lexical and syntactic features can be advantageous for L2 learners seeking to improve the fluency and perceived proficiency of their writing. However, there is an inherent risk that these changes may obscure a learner's authentic language proficiency by artificially inflating sophistication and diversity metrics. Educators and students should be aware of these tendencies, ensuring any feedback integrates a pedagogical focus on maintaining the learner's original voice and style.

From a theoretical perspective, the study underlines the need for refined constructs in L2 writing assessment that can accurately account for technologically mediated interventions. The findings suggest a necessity for assessment frameworks that differentiate between surface-level fluency improvements and deeper, learner-centric language development.

Future Directions

Looking forward, the study opens avenues for research into optimizing LLM integrations in educational contexts. A greater emphasis on user control and understanding of LLM editing patterns could enhance the utility of these models in educational settings. Moreover, the study calls for a detailed analysis of LLM interventions on a broader range of writing genres and for diverse demographic groups. Such research could inform the development of tailored LLM tools adaptable to specific educational objectives and learner needs.

This paper's rigorous comparative analysis significantly contributes to our understanding of how LLMs can supplement or replace traditional editing practices, paving the way for more nuanced applications of AI in language learning and beyond. It stresses the importance of balancing the immediate gains in readability and style with the overarching goals of long-term language acquisition and authenticity in learner expression.