HieroGlyphTranslator Project

Updated 22 January 2026

HieroGlyphTranslator is an end-to-end system that automates the recognition and translation of ancient Egyptian hieroglyphs using advanced computer vision and neural sequence modeling.
It employs a multimodal pipeline combining robust image segmentation, metric-learning-based glyph classification, and Transformer-based translation to handle complex layouts and scarce data.
The project leverages synthetic data augmentation and expert annotation to enhance segmentation accuracy, glyph recognition, and translation quality, making it valuable for digital humanities research.

The HieroGlyphTranslator Project focuses on the end-to-end automated recognition and translation of ancient Egyptian hieroglyphs from images into English text, integrating modern computer vision, sequence modeling, and data-augmentation techniques. The research area is characterized by multimodal, multi-stage architectures tailored to the unique graphical structure, class imbalance, scarce data, and linguistic ambiguity inherent in ancient Egyptian script. The state of the art builds upon dense segmentation, metric-learning classifiers for glyph recognition, neural sequence translation, and synthetic data generation, to enable scalable text extraction and analysis from historical sources.

1. Image Segmentation and Glyph Isolation

The translation pipeline for hieroglyphic texts begins with robust image segmentation, necessary due to the complex layouts and degraded physical state of Egyptian sources. Preprocessing employs the Hough transform to localize grid structures, followed by morphological rules for cleaning and normalization of column/row geometry. Segmentation fuses contour detection (as in Gong et al. 2018) with instance segmentation via Detectron2 Mask R-CNN, merging outputs to compensate for under-segmentation in either component. The workflow addresses false negatives by supplementing Mask R-CNN miss cases with connected components from contour analysis (Nasser et al., 3 Dec 2025).

For Maya hieroglyphs, analogous segmentation employs fine-tuned foundation models: Segment Anything Model (SAM) with a ViT-Base backbone, where only the mask decoder is updated during training. The loss is a sum of binary cross-entropy and Dice, and expert-curated polygon masks are used for supervision. Fine-tuned models outperform standard SAM and classical UNet/AEs on metrics including Intersection-over-Union (IoU) and DICE, especially under block-specific prompting (e.g., IoU 0.439 with two-point prompts) (Shivam et al., 2024).

2. Glyph Classification and Code Mapping

Segmented glyph images are individually classified, primarily into Gardiner codes—the canonical symbolic ontology for Egyptian signs. HieroGlyphTranslator uses a ResNet-50 CNN, adapted for 291-class classification. The system is trained with data-augmented images to offset domain and sample imbalance; the softmax output is mapped to code via argmax. Training and validation accuracy are high (∼99.7%/95.8%), but test accuracy is 81.8%, with F1 effects reflecting uneven class sizes (Nasser et al., 3 Dec 2025).

The OCR-PT-CT Project proposes a Deep Metric Learning (DML) approach using a MobileNetV2 trunk. The encoder outputs a 128-dimensional L2-normalized embedding, optimized with a contrastive loss function: $L(s, y) = (1‐y)\,(1‐s)^2 + y\;\max(s – m, 0)^2$ with $m=0.5$ , $s$ the cosine similarity between embeddings, and $y$ binary pair identity. After training, classification operates via nearest centroid in embedding space, providing robustness to rare sign classes and incremental extensibility with few-shot samples. The DML system attains 97.70% accuracy on a 140-class test set and >80% F1 even for classes with <10 training examples (Fuentes-Jimenez et al., 30 Dec 2025).

Synthetic‐data approaches accelerate classifier design where annotated photos are scarce. Creed (Creed, 2 Apr 2025) demonstrates neural style transfer (NST)—applying real glyph photo styles to a typeface base—producing synthetic datasets that, when used for classifier training, generalize to unseen images with up to 99% in-domain classification accuracy and 74% on unique external test sets.

3. Sequence Assembly, Transliteration, and Neural MT

Following glyph-level classification, codes are ordered according to document layout (columnar or row-wise), reconstructing encoded symbol sequences as input to translation modules. The canonical pipeline converts recognized codes to phonetic transliteration (e.g., “V31 Z1 M17 …”).

For full translation, a sequence-to-sequence Transformer (OpenNMT implementation) processes transliteration as source tokens and outputs English text. The architecture is a standard multi-head self-attention encoder-decoder, with subword vocabulary for transliteration and English. Translation performance, evaluated by BLEU, reaches 42.2 in (Nasser et al., 3 Dec 2025), compared to 22.4 for earlier baselines, and comparable or superior to other contemporary vision-to-language frameworks (Chen et al., 2024).

LogogramNLP (Chen et al., 2024) demonstrates that visual encodings (ViT-MAE-based “PIXEL-MT” encoder-decoder) surpass text-based T5/BPE pipelines when annotated text and symbol inventories diverge, yielding BLEU 29.16 for EGY→EN translation. Direct pixel-based models sidestep OCR error-induced performance collapse observed in textual approaches (character error rates exceed 50-70% in noisy photographs for ZHO and AKK scripts), and visual pretraining on large parallel corpora is essential for maximizing BLEU.

4. Data Resources, Annotation, and Augmentation

HieroGlyphTranslator and related projects draw on multiple annotated and synthetic datasets:

Morris Franken: 4,210 glyphs from real plates, 171 classes, extended to 5,430 images/291 classes via new data (Nasser et al., 3 Dec 2025).
Coffin Texts/Pyramid Texts (OCR-PT-CT): 262 original codes, filtered to 140 for benchmarking. Train/val/test/held-out splits are 26,653/5,874/5,874/4,622, with detailed augmentation (rotations, noise, occlusions) to simulate pictorial variance (Fuentes-Jimenez et al., 30 Dec 2025).
Synthetic NST datasets: e.g., 3,107 generated hieroglyph images (34 classes, stylized with archaeological photo texture) (Creed, 2 Apr 2025).
EgyptianTranslation: 12,938 sentences in Gardiner/transliteration–English pairs, covering funerary, literary, and historical genres (Nasser et al., 3 Dec 2025).
LogogramNLP: 2,337 Thot Sign List lines (EGY) for translation, with parallel English (Chen et al., 2024). Annotation tools include LabelMe, Roboflow, and custom ROI drawing UIs. Final outputs are typically tabular: spell/witness/tokens, MdC codes, segmentation coordinates, and classifier confidence.

5. Evaluation, Error Analysis, and Best Practices

Performance is systematically measured by accuracy (classification), macro-F1 (rare class robustness), BLEU (translation), and segmentation IoU/DICE.

Model/Approach	Glyph Acc (Test)	BLEU (Translation)	F1 (Rare)
ResNet-50 (291-class)	81.8%	42.22	0.805
CNN-End2End (140-class)	93.87%	N/A	~0.3 (rare)
Deep-MML (140-class)	97.70%	N/A	>0.8
NST-CNN (in-domain, 34)	99%	N/A	0.99
PIXEL-MT (LogogramNLP)	N/A	29.16	N/A

Common failure modes include misclassification of visually similar glyphs, over-/under-segmentation errors, and compounding translation errors. Limited parallel corpora size restricts translation fluency, especially for longer, complex sentences. Augmentation and expert-in-the-loop correction ameliorate these issues.

Best practices, as synthesized from project recommendations, include:

Prioritizing synthetic-augmented and expert-annotated datasets.
Adopting metric learning for scalable glyph recognition under class imbalance and incremental expansion.
Applying visual–seq2seq translation pipelines for maximal robustness where transcription resources are lacking.
Exporting machine-readable outputs (CSV, JSON) for downstream NLP, integration with broader Egyptological corpora (e.g., MORTEXVAR).
Visualizing the full chain: input image → segmentation → code → transliteration → English, with interface support for expert override at any stage.

6. Extensions, Limitations, and Future Research

Current systems are limited by the scale of annotated translation corpora, segmentation quality in noisy field photographs, and coverage of the rarest/compound glyph forms. Future work is focused on:

Enlarging paired transliteration–English datasets, especially across syntactic and textual genres (Nasser et al., 3 Dec 2025).
Integrating context-sensitive LLMs (e.g., n-gram or HDP) for post-OCR reranking and error correction (Fuentes-Jimenez et al., 30 Dec 2025).
Adapting vision transformers and efficient CNN backbones for higher capacity and transferability (Chen et al., 2024).
Deploying more extensive synthetic augmentation, covering the full Gardiner sign list and more variants (Creed, 2 Apr 2025).
Enabling full active-learning and expert-in-the-loop pipelines to iteratively improve model performance (Shivam et al., 2024).
Automating end-to-end pipelines for related scripts—demonstrated for Maya glyph segmentation (Shivam et al., 2024) and general logogram translation (Chen et al., 2024).

The HieroGlyphTranslator conceptual framework, by applying robust segmentation, metric-learning-based classification, and neural sequence-to-sequence translation, establishes a generalizable pattern for the digital processing and philological study of complex logographic languages. It supports both research and digital humanities workflows, with accuracy exceeding 93% on real-world hieroglyphic manuscripts and modular adaptability to new scripts, translation tasks, and linguistic domains (Fuentes-Jimenez et al., 30 Dec 2025, Nasser et al., 3 Dec 2025, Chen et al., 2024, Creed, 2 Apr 2025, Shivam et al., 2024).