Automated Essay Scoring (AES)

Updated 4 January 2026

AES is a machine learning-based system that assigns essay scores according to rubric-based criteria using NLP and deep neural networks.
It employs diverse methodologies—from feature-based regression to transformer and LLM approaches—to achieve accurate, consistent, and fair scoring.
AES systems are evaluated using metrics like Quadratic Weighted Kappa and RMSE, ensuring scalability and robustness across multiple languages and prompt types.

@@@@1@@@@ System (AES) is a machine learning-based approach to assigning scores to student essays, engineered to closely approximate human grading as defined by rubric-based criteria. AES systems combine advances in natural language processing, deep neural networks, and robust feature engineering to deliver fine-grained, scalable, and consistent assessment across a variety of writing tasks and languages. Contemporary AES research addresses both the predictive accuracy of scoring and the generalizability and fairness necessary for operational deployment in educational environments (Ludwig et al., 2021).

1. Task Definition and Model Paradigms

Automated Essay Scoring is formally defined as the supervised learning problem of predicting a discrete label or continuous score for a natural language text (essay), given labeled data produced by human raters. AES tasks can be formulated as classification (when the rubric is categorical), regression (continuous or ordinal scoring), or multi-task prediction in analytic settings (e.g., scoring across content, grammar, vocabulary, coherence) (Sun et al., 2024, Matsuoka, 2023). Models range from traditional feature-based linear regressors and SVMs to modern end-to-end neural architectures: RNN/CNN hybrids, transformers, and LLMs.

Model Category	Key Feature	Example System
Feature-based	Hand-crafted surface, syntactic, discourse, error features	SVM, Gradient Boosting (Portuguese: (Marinho et al., 2021), Arabic: (Bashendy et al., 30 Dec 2025))
Neural (RNN/CNN)	Contextual representations, sequence modeling	DeLAES (CNN+Bi-GRU) (Tashu et al., 2022)
Transformer	Contextual, subword-aware, transfer learning	BERT, RoBERTa, BERTimbau (Ludwig et al., 2021), efficient transformers (Ormerod et al., 2021)
LLM / Hybrid	Prompt-based, zero/few-shot, feature injection	Mistral-7B + linguistic counts (Hou et al., 13 Feb 2025); CAFES for multimodal AES (Su et al., 20 May 2025)

AES models are trained to minimize task-appropriate loss functions: cross-entropy for categorical labels, mean squared error (MSE) for regression, or hybrid (e.g., ordinal or margin-ranking) objectives in advanced architectures (Yang et al., 2024, Chakravarty, 17 Aug 2025).

2. Data Preprocessing and Input Representation

Preprocessing for AES involves standardized text cleaning (removal of HTML, normalization of whitespace, optional case folding), tokenization tailored to the downstream encoder (e.g., WordPiece/BPE for transformers), and the insertion of special tokens ([CLS], [SEP], [PAD]) as per the architecture’s requirements (Ludwig et al., 2021, Ormerod et al., 2021). Maximum sequence lengths are enforced—typically 512 tokens for BERT, up to 1024 for Longformer and Reformer.

Input formats vary by paradigm:

Sequence of tokens for end-to-end models,
Concat prompt and essay for prompt-adherence modeling,
Inclusion of structured signals (argument components, error annotations, scalar features) via embedding or prompt augmentation (Ormerod, 28 May 2025, Chakravarty, 17 Aug 2025).

For multilingual AES, tokenization must respect language-specific morphology and clitics (e.g., Portuguese with BERTimbau (Matsuoka, 2023), Arabic with AraBERT (Ghazawi et al., 2024)).

3. Model Architectures and Training Protocols

The evolution of AES architectures reflects the progression of NLP:

Feature-based learners: Regression on engineered surface, lexical, syntactic, and discourse features. These offer interpretability and cross-prompt robustness but limited representation capacity (Vajjala, 2016).
RNN/CNN hybrids: Models such as DeLAES combine multichannel 1D convolution (capturing n-gram patterns) and bidirectional GRUs (contextualizing salient features), optimized via MSE (Tashu et al., 2022).
Transformer-based models: Fine-tuned encoders (e.g., BERT, RoBERTa, BERTimbau, AraBERT) using pooled [CLS] representations with a regression or classification head. Dropout and weight decay regularize training; AdamW is the standard optimizer (Ludwig et al., 2021, Ormerod et al., 2021, Matsuoka, 2023).
Multi-dimensional/analytic scorers: Architectures output a vector per analytic trait/rubric dimension, concurrently optimizing trait-wise MSE or cross-entropy losses (e.g., AEMS, LAILA) (Sun et al., 2024, Bashendy et al., 30 Dec 2025).
Hybrid and context-augmented approaches: Augment transformer baselines with input-level enrichment—prompt context, margin-ranking loss, segmentation (EDU/argument structures), prompt-aware cross-attention, trait-similarity regularization, and hand-crafted essay features (Chakravarty, 17 Aug 2025, Do et al., 2023).
LLM-based and hybrid-injected systems: Combine LLMs (zero- or few-shot) with explicit linguistic feature vectors injected into prompts. Feature sets with high correlation to human scoring—e.g., unique word count, lemma count, complex word count—allow open-source LLMs to close performance gaps to supervised models (Hou et al., 13 Feb 2025).

Training typically uses supervised protocols, with cross-validation or hold-out splits, label normalization, and early stopping. Hyperparameters (learning rate, batch size, epochs) are systematically grid- or Bayesian-searched, often subject to resource constraints (GPU VRAM, batch fitting) (Ludwig et al., 2021).

4. Evaluation Metrics and Empirical Findings

The standard for measuring AES–human agreement is the Quadratic Weighted Kappa (QWK):

$\kappa = 1 - \frac{\sum_{i,j} w_{ij} O_{ij}}{\sum_{i,j} w_{ij} E_{ij}}$

where $w_{ij} = ((i - j)/(K-1))^2$ , and $O_{ij}$ , $E_{ij}$ are confusion matrices of observed and expected agreement (Ludwig et al., 2021, Matsuoka, 2023, Ghazawi et al., 2024).

Additional performance criteria:

F1, Precision, ROC AUC (binary/multiclass tasks)
Root Mean Squared Error (RMSE), MAE for absolute deviation (Matsuoka, 2023)
Exact-match and within-one accuracy (Ghazawi et al., 2024)
Standardized Mean Difference (SMD), conditional score difference (CSD), and MAED for bias/fairness audits (Yang et al., 2024).

Representative results:

Transformer baselines surpass BOW/logistic regression (BERT: QWK 0.79, F1 0.97, vs. Logistic: QWK 0.30, F1 0.91) (Ludwig et al., 2021).
Efficient transformers (Electra-small, Mobile-BERT) reach QWK ≈ 0.82 at sub-25M parameters (Ormerod et al., 2021).
Contextually augmented models (prompt, AC, and scalar features) lift QWK from 0.783 (Longformer) to 0.821 (Chakravarty, 17 Aug 2025).
Human–human agreement (exact match) is often ~25–50%, while transformer models achieve ≥79% on constrained tasks, 96% within-one (Ghazawi et al., 2024).
Multidimensional scoring (e.g., AEMS, LAILA) achieves analytic-dimension QWK >0.80 (Sun et al., 2024, Bashendy et al., 30 Dec 2025).

Best practices recommend reporting QWK per trait and globally, always complemented by RMSE/MAE/Pearson’s r and bias diagnostics across demographic axes (Yang et al., 2024, Matsuoka, 2023).

5. Operational, Fairness, and Scalability Considerations

AES systems are deployed as practical adjuncts or alternatives to manual grading, but several factors guide operational design:

Class imbalance: Must be handled via class weighting, balanced sampling, or oversampling. Overall accuracy is misleading under heavy skew—QWK and ROC AUC should be primary (Ludwig et al., 2021).
Domain shift: Models are sensitive to the scope of training data. Prompt-specific systems perform better within domain; cross-prompt generalization benefits from engineered features and prompt-independent modeling (Yang et al., 2024, Do et al., 2023).
Fairness: Economic status is the most bias-prone attribute; SVMs with interpretable features offer the best trade-off between accuracy and fairness in cross-prompt settings (Yang et al., 2024).
Scalability: Fine-tuned BERT-based systems process essays at <50 ms/essay on GPU; efficient transformers and ensembling allow for further latency/memory optimizations (Ormerod et al., 2021, Matsuoka, 2023).
Human-in-the-loop: High-confidence predictions can be auto-scored; low-confidence or adversarially uncertain cases routed to expert raters to reduce manual workload and flag potential model errors (Ludwig et al., 2021).
Ethical use: Transparency in rubric, model confidence, versioning, and auditing for demographic or language pattern bias are mandatory for high-stakes settings (Ludwig et al., 2021).

6. Domain Extensions and Multilingual AES

AES research has progressively extended into new languages (Portuguese (Matsuoka, 2023, Marinho et al., 2021), Basque (Azurmendi et al., 9 Dec 2025), Arabic (Bashendy et al., 30 Dec 2025, Ghazawi et al., 2024, Qwaider et al., 22 Mar 2025)), genres (argumentative, narrative, source-dependent), and multimodal settings (text + image) (Su et al., 20 May 2025). Trait-based and analytic scoring is increasingly favored for L2 applications and diagnostic feedback.

Key challenges in new domains include:

Data scarcity: Addressed with large synthetic corpus generation and error injection informed by real error distributions (Qwaider et al., 22 Mar 2025).
Morphological/orthographic complexity: Careful tokenization, data normalization, and feature design are required (Matsuoka, 2023, Bashendy et al., 30 Dec 2025).
Multimodality: Multi-agent frameworks like CAFES orchestrate text and image scoring with feedback-reflective loops, achieving 21% QWK boosts over single-agent LLMs (Su et al., 20 May 2025).
Rubric alignment: Fine-tuning on criterion-annotated data, feedback generation, and per-criterion error highlighting (e.g., Basque Correctness, Punctuation) (Azurmendi et al., 9 Dec 2025).

7. Interpretability, Adversarial Robustness, and Future Directions

Interpretability analyses (e.g., Integrated Gradients) reveal that deep AES models may over-rely on a “word-soup” of salient tokens, neglecting discourse, grammar, and world knowledge. Sorting tokens by attribution shows the vast majority of model score is explained by a small subset of content words; randomizing or adversarially corrupting essays can inflate scores or escape penalization (Parekh et al., 2020). Recommendations include:

Hierarchical or document-level encoders to capture global structure (Parekh et al., 2020).
Incorporation of factuality-check or world-knowledge modules to penalize false assertions.
Adversarial training (e.g., with lies, shuffling, or noise) to robustify against manipulation.
Multi-task architectures and regularization to distribute attention and improve generalization to unexpected inputs.

Continued development in cross-linguistic corpora, multimodal modeling, and fairness-directed auditing is critical. Version control, transparency, and continuous retraining with newly annotated essays are best practices for sustaining accuracy and equity (Ludwig et al., 2021, Matsuoka, 2023).

In sum, Automated Essay Scoring now comprises a spectrum of techniques—feature-driven, deep-learning, transformer-based, prompt- and trait-aware, multimodal, and hybrid—supported by explicit rubric alignment and defensible evaluation protocols. Recent progress substantiates both accuracy and operational viability, while the field remains vigilant to the risks of bias, domain shift, and adversarial exploitation.