Shakkala Neural System for Arabic Diacritization

Updated 20 January 2026

Shakkala Neural System is an end-to-end deep learning framework for Arabic diacritization that leverages Bi-LSTMs to model fine-grained orthographic and contextual dependencies.
It segments input text into character blocks and processes them with embeddings and bidirectional LSTMs, achieving a Diacritic Error Rate of 2.88% on a cleaned Tashkeela benchmark.
The system minimizes manual feature engineering and outperforms previous rule-based and statistical approaches, making it pivotal for applications like text-to-speech synthesis.

The Shakkala Neural System is an end-to-end deep learning framework for automatic Arabic text diacritization, leveraging character-level bidirectional long short-term memory (Bi-LSTM) networks with learned embeddings. Introduced by Barqawi & Zerrouki, Shakkala addresses the insertion of diacritics (short vowels and other phonetic marks) in Arabic texts—a challenge essential for downstream tasks such as text-to-speech synthesis and language learning—by modeling fine-grained orthographic and contextual dependencies at the character level. On a rigorously cleaned benchmark derived from the Tashkeela Corpus, Shakkala achieves a Diacritic Error Rate (DER) of 2.88%, significantly outperforming prior rule-based and statistical approaches (Fadel et al., 2019).

1. System Architecture and Inference Pipeline

Shakkala operates as a character-level sequence labeling model. At inference, input text is segmented into blocks of up to 315 characters (to accommodate model constraints). Each segment undergoes the following processing pipeline:

Encoding: Each character is mapped to an integer index over a vocabulary of approximately 60–70 Arabic characters, punctuations, and a dedicated “unknown” (UNK) token.
Embedding: Characters are embedded via a trainable matrix $E \in \mathbb{R}^{|V| \times d_c}$ , producing dense vectors $x_t \in \mathbb{R}^{d_c}$ .
Sequence Encoding: Two LSTMs operate bidirectionally:
- The forward LSTM processes the sequence from $x_1$ to $x_T$ ,
- The backward LSTM from $x_T$ to $x_1$ ,
- Their hidden states at each time step are concatenated: $h_t = [h^f_t; h^b_t] \in \mathbb{R}^{2d_h}$ .
Prediction: Each concatenated state $h_t$ is projected to a diacritic label distribution via a linear transformation and softmax:

$z_t = W_o h_t + b_o,\quad p(y_t = k | c_{1:T}) = \text{softmax}(z_t)_k$

The predicted diacritic label is $\hat{y}_t = \arg\max_{k \in D} p(y_t = k | c_{1:T})$ .

A schematic overview:

1	[c₁,...,c_T] → Embedding (E) → [x₁,...,x_T] → Bi-LSTM → [h₁,...,h_T] → Dense + Softmax → [ŷ₁,...,ŷ_T]

Input constraints are handled by segmenting input strings; concatenation of model outputs reconstitutes the fully diacritized line.

2. Mathematical Formulation

Let $c_{1:T}$ denote a sequence over the character vocabulary $V$ , with a target diacritic label set $D$ ( $K \approx 9$ , including “no-diacritic”). The system’s computation deconstructs as:

Embedding: $x_t = E \cdot \text{onehot}(c_t)$
Bidirectional LSTM:

$h^f_t = \text{LSTM}_f(x_t, h^f_{t-1}),\quad h^b_t = \text{LSTM}_b(x_t, h^b_{t+1}),\quad h_t = [h^f_t; h^b_t]$

Output & Prediction:

$z_t = W_o h_t + b_o,\quad p(y_t = k | c_{1:T}) = \frac{e^{z_{t,k}}}{\sum_{j=1}^K e^{z_{t,j}}}$

$\hat{y}_t = \arg\max_{k \in D} p(y_t = k | c_{1:T})$

Training Objective: The cross-entropy loss over $N$ training sequences:

$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^{T_i}\sum_{k=1}^K \mathbf{1}[y_{i,t}=k]\log p(y_{i,t}=k|c^{(i)}_{1:T_i})$

Evaluation Metric: Diacritic Error Rate (DER):

$\text{DER} = \frac{\# \text{diacritic errors}}{\text{total diacritics}} \times 100\%$

3. Training Data Construction and Preprocessing

Shakkala’s benchmarking relies on a carefully constructed corpus. The procedure is as follows:

Source material: Aggregation of 97 Classical Arabic books and 293 Modern Standard Arabic (MSA) documents from the Tashkeela Corpus, with additional inclusion of a simplified Qur’an version.
Cleaning: Automated scripts remove HTML tags, URLs, misplaced/separated diacritics (using regular expressions for specific cases such as ending diacritics and Fathatan + Alif corrections), non-Arabic/Kashida characters, and extraneous whitespace. Numbers are separated from letters, and spaces are collapsed.
Dataset Statistics: The resulting corpus consists of 55,000 lines (≈2.3M words). Only lines with at least 80% diacritized characters are retained.
Corpus Split:

| Subset | Lines | Words | Avg. Words/Line | |------------|--------|----------|-----------------| | Training | 50,000 | 2.10M | 42 | | Validation | 2,500 | 102K | 42 | | Test | 2,500 | 107K | 42 |

Tokenization: Strictly character-level; the only out-of-vocabulary scenario involves rare characters assigned to the UNK token.

4. Model Hyperparameters and Optimization

The primary source does not disclose exhaustive hyperparameter settings, but indicative values from the authors’ codebase are:

Character embedding dimension: $d_c \approx 100$
LSTM hidden size: $d_h \approx 200$ per direction (2 stacked layers)
Dropout: 50% between LSTM layers
Optimizer: Adam, learning rate $\approx 1 \times 10^{-3}$
Batch size: 64 sequences
Training duration: 10–20 epochs using early stopping determined by validation DER

The character limit per input (315 characters) is enforced throughout inference.

5. Empirical Performance and Error Analysis

Shakkala is evaluated using two metrics: Diacritic Error Rate (DER) and Word Error Rate (WER), under stringent settings (“no-diacritic” included and “without case ending”). Performance comparison with competing systems is summarized below:

System	DER (%)	WER (%)
Farasa	23.93	53.13
Harakat	17.03	32.03
MADAMIRA	29.94	59.07
Mishkal	13.78	26.42
Tashkeela-Model	52.96	94.16
Shakkala	2.88	6.53

Error Analysis: Shakkala exhibits substantial accuracy gains, particularly in final-letter diacritic prediction—a well-established challenge in Arabic NLP. Persisting errors are predominantly observed in rare morphological contexts and with loanwords not well represented in the training set.
Ablation: No ablation studies are reported.

6. Contributing Factors, Limitations, and Open Challenges

Key Enablers:
- Character-level embeddings obviate dependence on manual feature engineering, facilitating efficient learning of orthographic/diacritic regularities.
- Bidirectional LSTMs enable holistic, long-range modeling, crucial for the resolution of syntax- or morphology-dependent diacritic ambiguity.
- The scale and cleanliness of the Tashkeela-derived corpus (≈2.1M words) support robust statistical learning.
Constraints:
- Hard input segment limit of 315 characters per inference pass (necessitating text segmentation).
- Lack of a formal API; interaction is via a web interface or direct code inspection.
- No performance validation on colloquial Arabic or texts outside Classical/MSA genres.
- Error concentration in low-frequency morphological formations suggests a data-driven limitation.

A plausible implication is that further gains may require either corpus expansion with greater genre diversity or architectural augmentation for better handling of data sparsity (Fadel et al., 2019).

Markdown Report Issue Upgrade to Chat

References (1)

Arabic Text Diacritization Using Deep Neural Networks (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Shakkala Neural System.

Shakkala Neural System for Arabic Diacritization

1. System Architecture and Inference Pipeline

2. Mathematical Formulation

3. Training Data Construction and Preprocessing

4. Model Hyperparameters and Optimization

5. Empirical Performance and Error Analysis

6. Contributing Factors, Limitations, and Open Challenges

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Shakkala Neural System for Arabic Diacritization

1. System Architecture and Inference Pipeline

2. Mathematical Formulation

3. Training Data Construction and Preprocessing

4. Model Hyperparameters and Optimization

5. Empirical Performance and Error Analysis

6. Contributing Factors, Limitations, and Open Challenges

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research