Papers
Topics
Authors
Recent
Search
2000 character limit reached

Shakkala Neural System for Arabic Diacritization

Updated 20 January 2026
  • Shakkala Neural System is an end-to-end deep learning framework for Arabic diacritization that leverages Bi-LSTMs to model fine-grained orthographic and contextual dependencies.
  • It segments input text into character blocks and processes them with embeddings and bidirectional LSTMs, achieving a Diacritic Error Rate of 2.88% on a cleaned Tashkeela benchmark.
  • The system minimizes manual feature engineering and outperforms previous rule-based and statistical approaches, making it pivotal for applications like text-to-speech synthesis.

The Shakkala Neural System is an end-to-end deep learning framework for automatic Arabic text diacritization, leveraging character-level bidirectional long short-term memory (Bi-LSTM) networks with learned embeddings. Introduced by Barqawi & Zerrouki, Shakkala addresses the insertion of diacritics (short vowels and other phonetic marks) in Arabic texts—a challenge essential for downstream tasks such as text-to-speech synthesis and language learning—by modeling fine-grained orthographic and contextual dependencies at the character level. On a rigorously cleaned benchmark derived from the Tashkeela Corpus, Shakkala achieves a Diacritic Error Rate (DER) of 2.88%, significantly outperforming prior rule-based and statistical approaches (Fadel et al., 2019).

1. System Architecture and Inference Pipeline

Shakkala operates as a character-level sequence labeling model. At inference, input text is segmented into blocks of up to 315 characters (to accommodate model constraints). Each segment undergoes the following processing pipeline:

  • Encoding: Each character is mapped to an integer index over a vocabulary of approximately 60–70 Arabic characters, punctuations, and a dedicated “unknown” (UNK) token.
  • Embedding: Characters are embedded via a trainable matrix ERV×dcE \in \mathbb{R}^{|V| \times d_c}, producing dense vectors xtRdcx_t \in \mathbb{R}^{d_c}.
  • Sequence Encoding: Two LSTMs operate bidirectionally:
    • The forward LSTM processes the sequence from x1x_1 to xTx_T,
    • The backward LSTM from xTx_T to x1x_1,
    • Their hidden states at each time step are concatenated: ht=[htf;htb]R2dhh_t = [h^f_t; h^b_t] \in \mathbb{R}^{2d_h}.
  • Prediction: Each concatenated state hth_t is projected to a diacritic label distribution via a linear transformation and softmax:

zt=Woht+bo,p(yt=kc1:T)=softmax(zt)kz_t = W_o h_t + b_o,\quad p(y_t = k | c_{1:T}) = \text{softmax}(z_t)_k

The predicted diacritic label is y^t=argmaxkDp(yt=kc1:T)\hat{y}_t = \arg\max_{k \in D} p(y_t = k | c_{1:T}).

A schematic overview:

1
[c₁,...,c_T] → Embedding (E) → [x₁,...,x_T] → Bi-LSTM → [h₁,...,h_T] → Dense + Softmax → [ŷ₁,...,ŷ_T]
Input constraints are handled by segmenting input strings; concatenation of model outputs reconstitutes the fully diacritized line.

2. Mathematical Formulation

Let c1:Tc_{1:T} denote a sequence over the character vocabulary VV, with a target diacritic label set DD (K9K \approx 9, including “no-diacritic”). The system’s computation deconstructs as:

  • Embedding: xt=Eonehot(ct)x_t = E \cdot \text{onehot}(c_t)
  • Bidirectional LSTM:

htf=LSTMf(xt,ht1f),htb=LSTMb(xt,ht+1b),ht=[htf;htb]h^f_t = \text{LSTM}_f(x_t, h^f_{t-1}),\quad h^b_t = \text{LSTM}_b(x_t, h^b_{t+1}),\quad h_t = [h^f_t; h^b_t]

  • Output & Prediction:

zt=Woht+bo,p(yt=kc1:T)=ezt,kj=1Kezt,jz_t = W_o h_t + b_o,\quad p(y_t = k | c_{1:T}) = \frac{e^{z_{t,k}}}{\sum_{j=1}^K e^{z_{t,j}}}

y^t=argmaxkDp(yt=kc1:T)\hat{y}_t = \arg\max_{k \in D} p(y_t = k | c_{1:T})

  • Training Objective: The cross-entropy loss over NN training sequences:

L=1Ni=1Nt=1Tik=1K1[yi,t=k]logp(yi,t=kc1:Ti(i))\mathcal{L} = -\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^{T_i}\sum_{k=1}^K \mathbf{1}[y_{i,t}=k]\log p(y_{i,t}=k|c^{(i)}_{1:T_i})

  • Evaluation Metric: Diacritic Error Rate (DER):

DER=#diacritic errorstotal diacritics×100%\text{DER} = \frac{\# \text{diacritic errors}}{\text{total diacritics}} \times 100\%

3. Training Data Construction and Preprocessing

Shakkala’s benchmarking relies on a carefully constructed corpus. The procedure is as follows:

  • Source material: Aggregation of 97 Classical Arabic books and 293 Modern Standard Arabic (MSA) documents from the Tashkeela Corpus, with additional inclusion of a simplified Qur’an version.
  • Cleaning: Automated scripts remove HTML tags, URLs, misplaced/separated diacritics (using regular expressions for specific cases such as ending diacritics and Fathatan + Alif corrections), non-Arabic/Kashida characters, and extraneous whitespace. Numbers are separated from letters, and spaces are collapsed.
  • Dataset Statistics: The resulting corpus consists of 55,000 lines (≈2.3M words). Only lines with at least 80% diacritized characters are retained.
  • Corpus Split:

| Subset | Lines | Words | Avg. Words/Line | |------------|--------|----------|-----------------| | Training | 50,000 | 2.10M | 42 | | Validation | 2,500 | 102K | 42 | | Test | 2,500 | 107K | 42 |

  • Tokenization: Strictly character-level; the only out-of-vocabulary scenario involves rare characters assigned to the UNK token.

4. Model Hyperparameters and Optimization

The primary source does not disclose exhaustive hyperparameter settings, but indicative values from the authors’ codebase are:

  • Character embedding dimension: dc100d_c \approx 100
  • LSTM hidden size: dh200d_h \approx 200 per direction (2 stacked layers)
  • Dropout: 50% between LSTM layers
  • Optimizer: Adam, learning rate 1×103\approx 1 \times 10^{-3}
  • Batch size: 64 sequences
  • Training duration: 10–20 epochs using early stopping determined by validation DER

The character limit per input (315 characters) is enforced throughout inference.

5. Empirical Performance and Error Analysis

Shakkala is evaluated using two metrics: Diacritic Error Rate (DER) and Word Error Rate (WER), under stringent settings (“no-diacritic” included and “without case ending”). Performance comparison with competing systems is summarized below:

System DER (%) WER (%)
Farasa 23.93 53.13
Harakat 17.03 32.03
MADAMIRA 29.94 59.07
Mishkal 13.78 26.42
Tashkeela-Model 52.96 94.16
Shakkala 2.88 6.53
  • Error Analysis: Shakkala exhibits substantial accuracy gains, particularly in final-letter diacritic prediction—a well-established challenge in Arabic NLP. Persisting errors are predominantly observed in rare morphological contexts and with loanwords not well represented in the training set.
  • Ablation: No ablation studies are reported.

6. Contributing Factors, Limitations, and Open Challenges

  • Key Enablers:
    • Character-level embeddings obviate dependence on manual feature engineering, facilitating efficient learning of orthographic/diacritic regularities.
    • Bidirectional LSTMs enable holistic, long-range modeling, crucial for the resolution of syntax- or morphology-dependent diacritic ambiguity.
    • The scale and cleanliness of the Tashkeela-derived corpus (≈2.1M words) support robust statistical learning.
  • Constraints:
    • Hard input segment limit of 315 characters per inference pass (necessitating text segmentation).
    • Lack of a formal API; interaction is via a web interface or direct code inspection.
    • No performance validation on colloquial Arabic or texts outside Classical/MSA genres.
    • Error concentration in low-frequency morphological formations suggests a data-driven limitation.

A plausible implication is that further gains may require either corpus expansion with greater genre diversity or architectural augmentation for better handling of data sparsity (Fadel et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Shakkala Neural System.