- The paper introduces a multi-step error extraction and injection method to generate AI essays that accurately mimic the user's grammatical error profile.
- It demonstrates significant improvements in matching user error counts and quality scores, with statistical evidence indicating large effect sizes.
- The findings highlight a scalable approach for personalized peer learning, with potential applications in diverse educational domains.
Imitating Mistakes in a Learning Companion AI Agent for Online Peer Learning
This paper addresses the challenge of constructing AI agents that serve as effective learning companions in online peer learning environments, with a focus on English composition for EFL learners. The central hypothesis is that peer learning is most effective when the companion—here, an AI agent—exhibits a proficiency level and error profile closely matching that of the human learner. The authors propose and empirically validate a method for generating user-like mistakes in AI-generated essays, thereby enabling more authentic and productive peer learning interactions.
Peer learning, particularly in language education, is well-established as a means to foster deeper understanding and metacognitive skills. However, its effectiveness is contingent on the similarity of proficiency levels among peers. In online settings, this is difficult to guarantee, and the use of LLM-based agents as peers introduces new challenges: generic or overly correct AI responses can undermine the intended pedagogical benefits, leading to over-reliance or disengagement.
The paper identifies a gap in current AI-based peer learning systems: existing approaches to simulating learner errors are either too simplistic (e.g., inserting random grammatical errors) or insufficiently tailored to the individual learner's actual proficiency and error patterns.
Proposed Method
The authors introduce a multi-step pipeline for the Learning Companion AI Agent (LCAA) to generate essays that authentically reflect the user's error profile. The process is as follows:
- Error Extraction: The user's essay is corrected by the AI, and all grammatical and structural changes are logged.
- Error Categorization: Each correction is categorized by error type (e.g., subject-verb agreement, article usage), and the frequency of each type is counted.
- Error Injection: The LCAA is prompted to generate a new essay on a given topic, explicitly instructed to insert the same number and types of errors as identified in the user's writing.
- Output Generation: The resulting essay is intended to mirror the user's proficiency, both in terms of error count and qualitative writing characteristics.
This approach is inspired by Prompt Insertion and Chain-of-Thought prompting, leveraging LLMs' ability to follow multi-stage instructions and maintain fine-grained control over output characteristics.
Experimental Design
The evaluation involved eight Japanese EFL learners, each providing essays on four topics. For each user essay, two AI-generated essays were produced:
- Comparison Method: The LCAA is prompted to "imitate" the user's proficiency and style, as in prior work.
- Proposed Method: The LCAA follows the multi-step error extraction and injection pipeline.
All essays were analyzed using Grammarly, measuring both the number of grammatical errors and an overall quality score (0–100). The absolute differences between user and agent essays were computed for both metrics.
Results
The proposed method demonstrated a substantial improvement over the comparison method:
- Error Count: The average number of errors in user essays was 6.34. The proposed method's essays averaged 6.16 errors, while the comparison method averaged only 0.47.
- Quality Score: User essays averaged 60.06; the proposed method's essays scored 69.19, and the comparison method's essays scored 90.28.
- Absolute Differences: The proposed method reduced the absolute error difference to 2.06 (vs. 5.94 for the comparison method) and the quality score difference to 11.94 (vs. 30.22).
- Statistical Significance: T-tests yielded p-values of 3.0×10−8 (errors) and 1.7×10−10 (quality), with Cohen's d values of 3.70 and 2.91, indicating very large effect sizes.
These results strongly support the claim that the proposed method more accurately replicates the user's error profile, both quantitatively and qualitatively.
Implications and Limitations
The findings have several practical implications:
- Personalized Peer Learning: The method enables the construction of AI companions that provide more authentic peer learning experiences, potentially increasing learner engagement and efficacy.
- Scalability: The approach is compatible with existing LLM APIs and agent frameworks (e.g., LangChain), facilitating integration into current online learning platforms.
- Generalizability: While the study focuses on grammatical errors in English composition, the pipeline could be adapted to other domains where error modeling is relevant (e.g., programming, mathematics).
However, the method's current scope is limited to grammatical errors detectable by automated tools like Grammarly. More nuanced aspects of language proficiency—such as idiomatic usage, discourse coherence, or factual accuracy—are not directly addressed. Additionally, the method assumes that error types and frequencies are sufficient proxies for overall proficiency, which may not capture all relevant dimensions of learner ability.
Future Directions
Potential avenues for further research include:
- Extending Error Modeling: Incorporating semantic, pragmatic, and discourse-level errors to better simulate a broader range of learner mistakes.
- Adaptive Feedback: Using the error extraction pipeline to inform not only peer-like essay generation but also targeted feedback and scaffolding.
- Longitudinal Studies: Evaluating the impact of LCAA-driven peer learning on actual learner outcomes over extended periods.
- Cross-Linguistic and Cross-Domain Applications: Adapting the method to other languages and subject areas, assessing its generalizability and effectiveness.
Conclusion
This work presents a methodologically rigorous and practically viable approach to generating user-like mistakes in AI learning companions, addressing a key limitation in current peer learning systems. The strong empirical results suggest that fine-grained error modeling can significantly enhance the authenticity and pedagogical value of AI-based peer learning environments. The approach offers a foundation for more personalized, effective, and scalable AI companions in education, with broad implications for the design of future intelligent tutoring systems.