EgyptianTranslation Dataset

Updated 10 December 2025

EgyptianTranslation Dataset is a comprehensive resource comprising annotated hieroglyph images and a parallel corpus of Ancient Egyptian texts.
It enables end-to-end pipelines using image segmentation with ResNet50 and Transformer-based models, achieving a BLEU score of 42.22 for translation.
Robust image augmentation and meticulous annotation protocols mitigate class imbalance and segmentation challenges, though issues persist with low-sample and damaged glyphs.

The EgyptianTranslation dataset is a dual-resource compilation providing both an extensive collection of annotated Egyptian hieroglyph images for symbol classification and a parallel corpus of Ancient Egyptian texts aligned with expert English translations. This dataset is a cornerstone for machine learning-based recognition and translation of hieroglyphic writing, enabling end-to-end pipelines capable of converting real-world inscriptions into fluent English renderings. Primarily developed for and evaluated within the HieroGlyphTranslator project, it forms a comprehensive benchmark for symbol detection, transliteration, and automatic translation tasks involving the complex, semasiographic Egyptian writing system (Nasser et al., 3 Dec 2025).

1. Dataset Composition

The EgyptianTranslation dataset comprises two tightly linked subcomponents:

Image-Based Hieroglyph Dataset: Aggregates 5,430 high-quality glyph images spanning 291 Gardiner code classes. The images are sourced from:
- Ten digitized wall-inscription plates (Pyramid of Unas, Morris Franken), providing 4,210 segmented glyph instances across 171 classes.
- An auxiliary set of 1,220 glyphs, introducing an additional 120 Gardiner classes, with photography and manual annotation.
- Heavy on-the-fly image augmentation (rotations, flips, color jitter, cropping) yields 102,401 images for training, expanding effective class coverage and alleviating class imbalance.
Parallel Text Corpus: Contains 150 Ancient Egyptian texts drawn from funerary, literary, and historical domains, totaling 12,938 sentences. Each source is paired with its gold-standard English translation, the latter attributed to the Fayrouz Rose GitHub repository. The parallel sentences average approximately 86 per text.

The image and text subdatasets together enable full pipelines from raw inscription images to sentence-level English output.

2. Annotation Protocols

Glyph Image Labeling: Each symbol is tagged with its canonical Gardiner sign code (e.g., V31, Z1, M17). The Franken set includes pre-attributed codes; the 120 newly introduced classes are manually mapped by reference to Gardiner’s Sign List.
Parallel Text Alignment:
- Source Side: Consists of strings of predicted Gardiner codes output by the classification model.
- Transliteration: Gardiner code sequences are converted to standardized Middle Egyptian phonetic transcriptions using a deterministic finite-state transducer (FSM, see FSMNLP 2011).
- Target Side: English translations provided at the sentence level.
Preprocessing:
- Images undergo Hough transform-based cropping, contrast normalization, and binarization.
- Post-segmentation, partial or irrelevant detections are filtered; bounding box and segmentation normalization ensures consistent resolution (256×256 px for classification).
Text Curation: Parallel corpus is refined for sentence alignment; no explicit train/dev/test split is given, but standard OpenNMT conventions (80/10/10 by sentence) are assumed.

3. Data Acquisition and Infrastructure

Digitization and Photography: The original Pyramid of Unas plates were digitized from Piankoff (1969). Subsequent collections augmented the set with in-situ and museum-grade scans. Commercial image stocks (e.g., AdobeStock, AlamyStock) provided additional samples for segmentation and mask-training.
Annotation Tools: LabelMe (JSON) facilitates glyph segmentation labeling, while Roboflow prepares Detectron2-compatible segmentation masks. Images are stored in JPEG or PNG, with classification sets resized and segmentation sets at original resolutions.
Quality Control: Manual review eliminates degraded glyphs and visually ambiguous/misclassified examples.

4. Integration with HieroGlyphTranslator Pipeline

The EgyptianTranslation dataset directly supplies the inputs for a five-stage pipeline in the HieroGlyphTranslator architecture:

Preprocessing: Detects column/row structure using Hough transforms and applies denoising.
Segmentation: Employs a hybrid of contour detection and Detectron2 instance segmentation to localize glyphs.
Classification: Each glyph is classified using a ResNet50 model (fine-tuned on ImageNet; 291 output classes).
Gardiner Mapping & Transliteration: Sequences of predicted codes are mapped to Middle Egyptian transliterations using FSMs.
Seq2Seq Translation: Transliterated sentences are translated to English using a Transformer-based encoder-decoder (OpenNMT, 4 layers each, 8 attention heads, model dimension 512).

The dataset's schema thus supports the full workflow from multi-glyph plate images to fluent English output.

5. Model Training Regimen and Evaluation

Classification Model (ResNet50):

Architecture: ResNet50 backbone, GlobalAveragePooling, Dense(512, ReLU) with Dropout(0.5), Dense(291, softmax).
Optimizer: Adam, learning rate ~1×10^{−4}.
Training: Batch size 32, 30 epochs with early stopping; categorical cross-entropy loss.
Data split: 60% train, 20% validation, 20% test.

Translation Model (OpenNMT Transformer):

Encoder/Decoder: 4 layers each, 512 units, 2048 FFN, 8 attention heads.
Dropout: 0.1; Adam optimizer with warm-up.
Training: Up to 100,000 steps; token-level cross-entropy loss; convergence by validation perplexity.

Metrics:

Classification:
- Final test accuracy: 81.8%
- Precision: 0.8724; Recall: 0.8052; F1 Score calculated as $2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$
Translation:
- Perplexity: 1.4407
- BLEU score for hieroglyph→English: 42.22 (outperforming prior state of the art: 22.38).

Model	Main Metric	Result	Reference
Classification	Accuracy	81.8 %	(Nasser et al., 3 Dec 2025)
Translation	BLEU	42.22	(Nasser et al., 3 Dec 2025)

MT BLEU is computed as $\operatorname{BLEU} = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)$ , with $BP$ the brevity penalty.

6. Principal Limitations and Open Challenges

Class Imbalance: Several Gardiner codes have <10 samples; the imbalance is mitigated but not fully eliminated by augmentation.
Segmentation Error: Low-contrast or damaged carvings cause failures, despite hybridization with Detectron2; further domain-adaptive masking is suggested.
Ambiguous/Damaged Glyphs: Visually similar signs can still be misclassified post-curation.
Parallel Corpus Size: The translation set (12,938 sentences) leads to occasional ungrammatical English; increasing corpus breadth and leveraging pre-trained LMs is proposed for enhancement.

7. Research Impact and Prospects

The EgyptianTranslation dataset supplies a modular resource for benchmarking and advancing deep learning approaches in both ancient script recognition and translation. Applications include:

Training and evaluation for end-to-end systems on real-world document photographs.
Testing symbol classification and transliteration robustness against the full Gardiner sign inventory.
Benchmarking translation systems from Middle Egyptian to English, supporting both linguistic and NLP research.

The dataset is foundational for future work integrating larger parallel corpora, improved segmentation with domain-specific adaptation, and the incorporation of pre-trained LLMs to further raise translation accuracy and naturalness (Nasser et al., 3 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

HieroGlyphTranslator: Automatic Recognition and Translation of Egyptian Hieroglyphs to English (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EgyptianTranslation Dataset.