MS-ASL: A Large-Scale Data Set and Benchmark for Understanding American Sign Language

Published 3 Dec 2018 in cs.CV | (1812.01053v2)

Abstract: Sign language recognition is a challenging and often underestimated problem comprising multi-modal articulators (handshape, orientation, movement, upper body and face) that integrate asynchronously on multiple streams. Learning powerful statistical models in such a scenario requires much data, particularly to apply recent advances of the field. However, labeled data is a scarce resource for sign language due to the enormous cost of transcribing these unwritten languages. We propose the first real-life large-scale sign language data set comprising over 25,000 annotated videos, which we thoroughly evaluate with state-of-the-art methods from sign and related action recognition. Unlike the current state-of-the-art, the data set allows to investigate the generalization to unseen individuals (signer-independent test) in a realistic setting with over 200 signers. Previous work mostly deals with limited vocabulary tasks, while here, we cover a large class count of 1000 signs in challenging and unconstrained real-life recording conditions. We further propose I3D, known from video classifications, as a powerful and suitable architecture for sign language recognition, outperforming the current state-of-the-art by a large margin. The data set is publicly available to the community.

Abstract PDF Upgrade to Chat

Citations (224)

View on Semantic Scholar

Summary

The paper introduces MS-ASL, the first large-scale real-world ASL dataset with over 25K videos and 1000 sign classes, addressing challenges in signer independence and variability.
The paper benchmarks multiple state-of-the-art recognition architectures, finding that 3D CNNs (I3D) achieve top-1 accuracies up to 81.9% compared to other models.
The results underscore the impact of data scale and pretraining on model performance, highlighting the need for enhanced keypoint extraction for fine-grained sign distinctions.

MS-ASL: Establishing a Large-Scale Benchmark for American Sign Language Recognition

Introduction and Motivation

The "MS-ASL: A Large-Scale Data Set and Benchmark for Understanding American Sign Language" paper addresses the long-standing deficit of large, diverse, and realistically acquired datasets for American Sign Language (ASL) recognition. The work is motivated by the requirements posed by modern deep learning architectures, which demand extensive annotated datasets to facilitate generalization, especially in tasks involving high degrees of inter- and intra-class variability such as sign language recognition. The lack of such resources for ASL had precluded meaningful progress on signer-independent recognition and hindered the application of state-of-the-art computer vision models in this field.

MS-ASL Dataset Design and Characteristics

The paper introduces MS-ASL, the first large-scale, real-world ASL dataset, comprising over 25,000 manually annotated video samples representing 1000 sign classes demonstrated by 222 unique signers. The dataset is constructed through automated mining and expert curation of online ASL teaching and student videos. This methodology introduces substantial variability in background, lighting, camera viewpoint, signer clothing, and performance style. This variability is further amplified by the intrinsic linguistic and dialectal diversity of ASL, including synonymy and polysemy at the gloss level.

The dataset addresses key deficiencies of prior resources by ensuring:

Substantial class count (up to 1000 different signs), enabling scaling experiments.
Large signer count and strict signer-independence splits (distinct sets for training, validation, and testing), directly targeting the generalization gap.
Unconstrained, real-world recording settings.
Accurate manual annotation and trimming, especially of long or low-frame videos, to minimize irrelevant context and label noise.

The resulting distribution of video durations, with a mean of ~60 frames and a Poisson-like profile, reflects the complexity and variability of real-world sign production.

Figure 1: Histogram of frame numbers for ASL1000 video samples, indicating the distribution and duration variability of real-world signing instances.

Four subsets are provided for benchmarking: ASL100, ASL200, ASL500, and ASL1000, containing the 100, 200, 500, and 1000 most frequent signs, respectively. Each split is strictly signer-independent, enabling robust experimental design for generalization-centric studies.

Baseline Architectures and Experimental Protocol

The paper benchmarks four representative recognition paradigms:

2D CNN + LSTM: Framewise feature extraction (e.g., VGG16 or GoogleNet) followed by temporal modeling with LSTMs (256-512 hidden units), reflecting the action recognition pipeline of the previous decade.
Body Key-Point Networks (HCN): Skeleton-based models using 137 keypoints per frame (including hands and face) extracted with state-of-the-art pose estimators [cao2017realtime, simon2017hand]. Input shape is $[64 \times 137 \times 3]$ (per video window) capturing spatial, temporal, and confidence components.

Figure 2: Example of the extracted 137 body key-points for a video sample from MS-ASL, capturing hand, facial, and body articulation critical to sign disambiguation.

CNN-LSTM-HMM hybrid ("Re-Sign"): End-to-end deep sequence models combining CNNs, bidirectional LSTMs, and HMMs to align visual features with gloss sequences.
3D CNNs (I3D): Inflated 3D convolutional networks (I3D), initialized from large-scale action datasets (ImageNet + Kinetics), directly operating on $64 \times 224 \times 224$ RGB input cubes to jointly capture spatial and motion cues.

Key aspects:

Input normalization: Detected bounding box extraction (via SSD), bounding box jittering, resizing, and horizontal flipping (exploiting ASL’s handedness invariance).
Temporal sampling: Fixed-length (64 frames) random clips, with undersized videos padded by frame repetition.
No use of optical flow for any baseline to isolate the effect of RGB cues and architectural choices.

Empirical Results and Analysis

Quantitative Results

I3D achieves the highest performance, with top-1 per-class accuracy of 81.8%, 81.9%, 72.5%, and 57.7% for ASL100, ASL200, ASL500, and ASL1000 respectively, and top-5 per-class accuracy reaching 95.2% on ASL100 and 81.1% on ASL1000. These values are substantially higher than those of 2D CNN + LSTM, keypoint-only, or hybrid HMM architectures.

Impact of Dataset Scale

Expanding the number of classes from 100 to 1000 yields a moderate degradation in accuracy, whereas the drop is less severe when more training samples per class are available. In cross-subset transfer, I3D models pretrained on larger subsets transfer effectively to smaller subsets, outperforming ImageNet+Kinetics pretraining by >3.5%.

Data Efficiency and Frequency

Model performance exhibits a clear dependency on the per-class sample count. For classes with >40 training samples, accuracy plateaus at ~80% or above; for lower frequencies, recognition degrades rapidly—highlighting sample efficiency limits even of powerful video models.

Figure 3: The accuracy of trained models as a function of the frequency of training samples per class, demonstrating that performance is strongly sample-limited for rare classes.

Qualitative and Error Analysis

Analysis of confusion matrices reveals that most misclassifications arise from intrinsic ASL ambiguities (e.g., signs for "good" and "thanks” often are visually minimal pairs), dialectal variation, or context-dependent polysemes. Residual errors in large-sample classes are often due to linguistic ambiguity rather than model capacity, underscoring the need for multimodal integration with LLMs or context disambiguation.

Keypoint-based representations

While keypoint-based recognition (HCN) surpasses classical 2D CNN-LSTM in accuracy, its performance is ultimately capped by the robustness of hand and finger keypoint estimation—currently limiting fine-grained sign discrimination in unconstrained settings.

Implications and Future Directions

The publication of MS-ASL marks a pivotal improvement in the empirical rigor and real-world relevance of sign recognition research. It fills a critical gap for benchmarking deep models under high signer, environmental, and linguistic variability and enables reproducible comparison for a range of approaches. Key takeaways:

3D CNNs with large video datasets and discriminative pretraining are state-of-the-art for signer-independent, large-vocabulary sign classification from RGB.
Pretraining on in-domain video (sign) data decisively outperforms generic action recognition initialization.
Keypoint-based methods require advances in pose extraction, especially for fine hand gestures, to close the gap with spatiotemporal convolutional approaches.

Practically, MS-ASL facilitates new research on transfer learning, data augmentation, adaptation to unseen signers or dialects, multi-modal fusion, and context-based re-scoring for sign language understanding pipelines. The dataset design strategy, including the use of openly available video sources and scalable annotation refinement, is also widely generalizable to other low-resource gestural languages and modalities.

Conclusion

The MS-ASL dataset and benchmark represent a substantial advance in the field of sign language recognition, providing an empirically challenging task with rigorous signer-independence, significant environmental and linguistic variability, and the scale necessary for modern deep architecture development. The accompanying baseline experiments clearly substantiate the superiority of I3D-based video models for signer-independent large vocabulary ASL recognition and provide a reference point for future work. The dataset's public availability will foster accelerated progress on robust, scalable, and inclusive sign language understanding systems.

Markdown Report Issue