KD-MSLRT: Lightweight Sign Language Recognition Model Based on Mediapipe and 3D to 1D Knowledge Distillation

Published 4 Jan 2025 in cs.CY | (2501.02321v3)

Abstract: Artificial intelligence has achieved notable results in sign language recognition and translation. However, relatively few efforts have been made to significantly improve the quality of life for the 72 million hearing-impaired people worldwide. Sign language translation models, relying on video inputs, involves with large parameter sizes, making it time-consuming and computationally intensive to be deployed. This directly contributes to the scarcity of human-centered technology in this field. Additionally, the lack of datasets in sign language translation hampers research progress in this area. To address these, we first propose a cross-modal multi-knowledge distillation technique from 3D to 1D and a novel end-to-end pre-training text correction framework. Compared to other pre-trained models, our framework achieves significant advancements in correcting text output errors. Our model achieves a decrease in Word Error Rate (WER) of at least 1.4% on PHOENIX14 and PHOENIX14T datasets compared to the state-of-the-art CorrNet. Additionally, the TensorFlow Lite (TFLite) quantized model size is reduced to 12.93 MB, making it the smallest, fastest, and most accurate model to date. We have also collected and released extensive Chinese sign language datasets, and developed a specialized training vocabulary. To address the lack of research on data augmentation for landmark data, we have designed comparative experiments on various augmentation methods. Moreover, we performed a simulated deployment and prediction of our model on Intel platform CPUs and assessed the feasibility of deploying the model on other platforms.

Abstract PDF Upgrade to Chat

Summary

The paper introduces KD-MSLRT, a lightweight sign language recognition model combining Mediapipe landmarks and 3D-to-1D knowledge distillation for efficient processing.
KD-MSLRT significantly reduces model size (12.93 MB TFLite) while achieving state-of-the-art accuracy, like a 17.4% Word Error Rate on PHOENIX14.
The model enables real-time sign language recognition on mobile/edge devices and contributes a large Chinese Sign Language dataset.

Insightful Overview of KD-MSLRT: A Lightweight Sign Language Recognition Model

In the domain of sign language recognition, Yulong Li et al.'s paper introduces KD-MSLRT, a novel lightweight model employing knowledge distillation techniques alongside Mediapipe for efficient sign language recognition. This research addresses the critical challenges of deploying sign language recognition systems, focusing on the significant computational expense common to video-based models and the dearth of comprehensive datasets, particularly for Chinese Sign Language (CSL).

Key Contributions and Methodology

The proposed system, KD-MSLRT, incorporates three main components: knowledge distillation from a 3D to 1D model, landmark-based sign language recognition, and a text correction network. The integration of these components ensures not only effective recognition capability but also a reduction in processing requirements, making the model viable for real-time applications on limited-resource devices.

Knowledge Distillation:
- The authors leverage a state-of-the-art video-based CorrNet model to distill knowledge into the lightweight student network (MSLR). This distillation bridges the information gap inherent due to reduced data dimensionality from video to landmarks, significantly improving performance.
Landmark Data and Augmentation:
- By utilizing Mediapipe for real-time skeletal landmark extraction, the input data complexity is drastically reduced, enabling computational efficiency.
- The introduction of specialized data augmentation techniques ensures that the model is robust against variations in signing styles and conditions, essential for effective landmark-based recognition.
Text Correction:
- A dual-transformer structure is effectively used to refine the signing output, correcting common recognition errors. The self-supervised training strategy enhances output accuracy by addressing sequence and grammatical errors.
Dataset Contributions:
- Addressing the scarcity in the field of CSL, the authors release a substantial dataset containing 12,000 samples focused on long and complex sentences typically prevalent in news contexts. This dataset is expected to foster further research and development in CSL recognition.

Experimental Findings

The paper provides empirical evidence demonstrating KD-MSLRT's superiority over existing state-of-the-art models in terms of efficiency and accuracy. The model achieves a consistent decrease in Word Error Rate (WER) on benchmark datasets, notably achieving 17.4% WER on the PHOENIX14 test set—significantly outperforming peers. Furthermore, the TensorFlow Lite quantized model size is reduced to an impressive 12.93 MB, indicating its deployability on mobile devices and other edge platforms.

Practical and Theoretical Implications

Practically, the KD-MSLRT system represents a laudable step towards making robust sign language recognition more accessible and deployable in real-time settings—a crucial factor for enhancing communication accessibility for the hearing-impaired community globally. Theoretically, this work paves the way for further explorations into knowledge distillation techniques across modalities and encourages the exploration of landmark-based learning models as viable alternatives to resource-heavy video-based systems.

Future Directions

Moving forward, this research opens up various avenues for development. Future work could focus on expanding multilingual sign language datasets, further refining augmented landmark data techniques, and exploring cross-linguistic knowledge transfer methods. The lightweight nature of the KD-MSLRT model suggests potential applications in augmented reality (AR) and virtual reality (VR) systems, where real-time input processing is critical.

In conclusion, Yulong Li and colleagues present a comprehensive approach to enhancing sign language recognition via an innovative combination of computational techniques and data strategy, offering both practical utility and a basis for advancing research in human-computer interaction fields.

Markdown Report Issue