- The paper introduces a framework that uses PLM-based surrogate models to estimate and control cloze-test difficulty through IRT assessment.
- It employs gap entropy and distractor ranking strategies to modulate difficulty levels and mimic human testing responses.
- Results demonstrate scalable adaptive testing benefits with insights for refining difficult item generation in advanced assessments.
Controlling Cloze-test Question Item Difficulty with PLM-based Surrogate Models for IRT Assessment
Introduction
The paper "Controlling Cloze-test Question Item Difficulty with PLM-based Surrogate Models for IRT Assessment" (2403.01456) explores the fundamental role of item difficulty in the context of adaptive testing, focusing on multiple-choice cloze tests. These tests serve as a critical tool in evaluating reading comprehension and language proficiency across standardized exams like TOEFL, TOEIC, and IELTS. Previous research predominantly examined individual components such as distractor generation, often neglecting variable difficulty levels. This paper addresses this gap by proposing a framework utilizing pre-trained LLMs (PLMs) as surrogates within an Item Response Theory (IRT) framework to assess difficulty, thereby eliminating the need for human participants.
Methodology
The methodology involves two primary components. First, it employs PLM-based models to estimate item difficulty levels by simulating human test performance. Second, it implements strategies to manipulate both gaps and distractors to control test item difficulty. This involves ranking rules aimed at minimizing invalid distractor generation, which are a frequent source of inaccuracy in item difficulty assessment. By fine-tuning models like BigBird and Electra, the study evaluates changes in item difficulty and establishes a systematic framework for generating question items of controlled difficulty levels.
IRT Assessment with PLM-based Surrogate Models: The approach circumvents the traditional reliance on human test takers through a series of PLMs trained to emulate human performance indicative of various difficulty levels. These surrogate models mimic human responses across different difficulty-modified test versions, testing shifts in item difficulty across versions through an IRT framework.
Difficulty-controllable Question Generation
Gap Difficulty Control: The paper leverages entropy calculations using PLM confidence scores to predict gap difficulty, ranking gaps from high to low entropy to frame the difficulty level of a question. This uses the entropy of PLM predictions as a key signal to determine whether questions are hard or easy.
Distractor Difficulty Factors: The study presents novel distractor generation strategies integrating confidence scoring, semantic similarity, and Levenshtein distance. It introduces robustness through validity rules preventing the selection of plausible yet incorrect distractors ranked higher than the correct answer. The strategies are denoted as Confidence-Ranking Control and 3-Factor Ranking Control, each providing a structured method for distractor selection to suit varying difficulty levels.
Experiment Design and Results
The experiments utilize CLOTH datasets differentiated by proficiency to assess the proposed framework's efficacy in controlling item difficulty via systematic testing folds. A set of PLMs takes tests at each difficulty level, allowing direct IRT model fitting to analyze the spread of item difficulties.
The results indicate that while both the introduced strategies can effectively manipulate item difficulty, they vary in their ability to do so across proficiency levels. Specifically, the CLOTH-H dataset for advanced students showed narrower distributions, suggesting limitations in adjusting difficulties. Gap control appears less crucial, primarily enhancing variability within easier question sets.
Discussion and Implications
The study underscores the promise of using PLMs as surrogate models in educational testing, being both scalable and less reliant on human annotations. By systematically validating the PLM-based IRT framework and difficulty-control strategies, the paper offers significant implications for the adaptive testing field. Nevertheless, the paper recognizes constraints in controlling harder items, particularly in more advanced assessments like CLOTH-H. Future research is encouraged to refine these models, especially expanding beyond linguistic assessments and enhancing validity rule efficacy to eliminate invalid distractors definitively.
Conclusion
This research contributes a novel metric framework facilitating difficulty control in cloze-item generation, leveraging PLM adaptability in surrogate IRT assessments. It demonstrates that controlling tests for middle school and high school proficiency levels can be systematically achieved through combined manipulation of distractors and gaps, offering scalable solutions in adaptive testing. The findings suggest continued exploration into broader datasets and model optimizations to address existing gaps and improve the generation and evaluation processes inherent in educational assessments.