Morris Franken Dataset Overview

Updated 10 December 2025

Morris Franken Dataset is a hieroglyphic sign-image corpus with annotated Gardiner codes, enabling automated recognition of Egyptian hieroglyphs.
It incorporates rigorous preprocessing, including RGB conversion, resizing to 224x224, and normalization, to support CNN-based classification models.
The dataset underpins projects like HieroGlyphTranslator by offering structured train/validation/test splits and employing augmentation to tackle class imbalance.

The Morris Franken dataset is a hieroglyphic sign-image dataset extracted from the wall relief plates of the Pyramid of Unas, the first dynastic king of the Fifth Dynasty. As originally assembled by Piankoff (1969) and later disseminated by Franken & Van Gemert (2013), it serves as a primary resource for research into the automatic recognition of Egyptian hieroglyphs. Its central aim is to provide annotated examples of individual hieroglyphic signs, each linked to standard Gardiner classification codes, enabling the development and benchmarking of machine learning models for single-sign recognition. The dataset has undergone substantive expansions and preprocessing to meet the specialized demands of deep-learning pipelines, such as those employed in the HieroGlyphTranslator project (Nasser et al., 3 Dec 2025).

1. Historical Origin and Purpose

The “Morris Franken” dataset consists of hieroglyphic sign images systematically extracted from high-resolution photographic plates documenting the Pyramid of Unas. The initial assembly by Piankoff (1969) focused on providing a research corpus for Egyptology. Franken & Van Gemert (2013) later prepared the collection as a machine-readable resource designed for computational studies in sign recognition. The dataset’s core function is to supply example images for each hieroglyph class, annotated according to the Gardiner code system—a canonical scheme specifying both the semantic and graphic identity of hieroglyphic signs. These annotations enable both supervised learning for single-sign identification and serve as ground-truth reference points for downstream translation pipelines.

2. Dataset Composition, Class Distribution, and Extension

The original release comprises 10 photographic plates, manually segmented into 4,210 crop images, each displaying a single hieroglyph. There are 171 distinct Gardiner classes in this set, with class frequencies spanning a considerable range: the most populous class contains approximately 75 instances, while the rarest has about 5, giving an imbalance ratio $\text{IR} \approx 15$ . Subsequent extensions to the dataset introduced an additional 120 sign classes—yielding 291 unique Gardiner codes and a total of 5,430 natural sign images.

Dataset State	Number of Classes (C)	Number of Samples (N)
Original release	171	4,210
Extended (+120)	291	5,430
Post-augmentation	291	102,401

Each class $i$ contains $N_i$ samples, such that $\sum_{i=1}^C N_i = N$ . The class proportions are $p_i = N_i / N$ and satisfy $\sum_{i=1}^C p_i = 1$ . Augmentation during training expanded the effective dataset size to $N_\text{aug} = 102,401$ via geometric and photometric perturbations.

3. Image Format, Preprocessing, and Standardization

Each hieroglyphic sign is made available as a per-glyph crop in PNG or JPEG format, sourced from color scans at approximately 300 dpi. The image-processing protocol for neural networks involved several standardized steps:

Conversion to three-channel RGB.
Resizing to $224 \times 224$ pixels, matching the ResNet50 input default.
Pixel intensity mapping from $[0,255]$ to $[0,1]$ floating point.
Normalization against ImageNet mean and standard deviation per channel: $i$ 0 and $i$ 1.
No binarization or deskewing was performed post-segmentation, as glyph crops were tightly bounded upstream.

The pixel–intensity channelwise statistics are computed (after resizing and before augmentation) as:

$i$ 2

4. Annotation Scheme and Label Mapping

Every glyph crop in the dataset is tagged with a standard Gardiner code (e.g., “G17”), which functions as its class label for classification tasks. Image filenames and CSV manifest files maintain the correspondence between image and Gardiner code. Code-to-ID mapping is managed internally by associating each Gardiner code with an integer class ID in $i$ 3, and during training, class IDs are represented as one-hot vectors for the ResNet50 softmax classifier. This workflow standardizes the integration of the dataset into contemporary CNN-based recognition architectures.

5. Segmentation and Spatial Supervision

The original release of the Morris Franken dataset contains only tightly-cropped sign images and lacks explicit segmentation masks. To address glyph segmentation in full-scene images, a hybrid front-end was constructed, combining contour-based connected component detection (via OpenCV) with a Detectron2 (Mask R-CNN) model trained on 51 fully-annotated wall-text photographs (LabelMe JSON masks). Bounding boxes generated in this manner were validated by measuring intersection-over-union (IoU) $i$ 4 against a held-out set of 10 manually segmented Unas plate images, confirming that the detected boxes reliably encompass individual glyphs.

6. Role in HieroGlyphTranslator Pipeline

The Morris Franken dataset is central to the “symbol → code” block of the HieroGlyphTranslator deep learning pipeline (Nasser et al., 3 Dec 2025). Its usage spans several key stages:

Segmentation validation: A subset of Unas plates from the dataset is processed by the segmentation module, and IoU is computed against ground-truth crops to verify detection accuracy.
Classification (symbol-to-code mapping): Each normalized glyph crop is input to a ResNet50 model (ImageNet pre-trained, then fine-tuned on the glyph set), outputting a one-hot encoded prediction over 291 Gardiner classes.
Train/validation/test splits: The 5,430 real samples are divided into 60% training (3,258), 20% validation (1,086), and 20% test (1,086) subsets.
Class imbalance mitigation: Real-time augmentation (rotations, translations, brightness/contrast, and zoom) increases training variability, expanding the pool to ≈43,000 unique examples before each epoch.
Sequence-to-sequence translation: Recognized Gardiner code sequences are converted to transliteration signs, tokenized, and input to a Transformer-based encoder–decoder (OpenNMT implementation). End-to-end training, incorporating code sequences from Unas glyph lines and the EgyptianTranslation dataset, produces a BLEU score of 42.2 as measured by the WizardWesenbach BLEU+ sacreBLEU evaluation.

7. Context, Limitations, and Significance

The Morris Franken dataset remains foundational for advancements in the automatic recognition of Egyptian hieroglyphs as demonstrated by its pivotal role in the HieroGlyphTranslator project. Its strengths include thoroughly curated and annotated sign crops and broad coverage across the Gardiner code spectrum. A plausible implication is that its imbalance ratio (IR ≈ 15) poses challenges for low-frequency class generalization, necessitating aggressive augmentation strategies for practical model performance. The dataset’s tight-bounding crop convention obviates the need for post-classification binarization or deskewing, streamlining integration with convolutional architectures. Future directions may include the release of explicitly labeled segmentation masks or further expansion of sign classes beyond the current 291, to enhance both spatial supervision and overall representation. Its use as a benchmark anchors ongoing research in computational Egyptology, deep learning for visual symbol systems, and low-resource OCR.

Markdown Report Issue Upgrade to Chat

References (1)

HieroGlyphTranslator: Automatic Recognition and Translation of Egyptian Hieroglyphs to English (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Morris Franken Dataset.