SilentWear: an Ultra-Low Power Wearable System for EMG-based Silent Speech Recognition

Published 3 Mar 2026 in eess.SP | (2603.02847v2)

Abstract: Detecting speech from biosignals is gaining increasing attention due to the potential to develop human-computer interfaces that are noise-robust, privacy-preserving, and scalable for both clinical applications and daily use. However, most existing approaches remain limited by insufficient wearability and the lack of edge-processing capabilities, which are essential for minimally obtrusive, responsive, and private assistive technologies. In this work, we present SilentWear, a fully wearable, textile-based neck interface for EMG signal acquisition and processing. Powered by BioGAP-Ultra, the system enables end-to-end data acquisition from 14 differential channels and on-device speech recognition. SilentWear is coupled with SpeechNet, a lightweight 15k-parameter CNN architecture specifically tailored for EMG-based speech decoding, achieving an average cross-validated accuracy of 84.8$\pm$4.6% and 77.5$\pm$6.6% for vocalized and silent speech, respectively, over eight representative human-machine interaction commands collected over multiple days. We evaluate robustness to repositioning induced by multi-day use. In an inter-session setting, the system achieves average accuracies of 71.1$\pm$8.3% and 59.3\pm2.2% for vocalized and silent speech, respectively. To mitigate performance degradation due to repositioning, we propose an incremental fine-tuning strategy, demonstrating more than 10% accuracy recovery with less than 10 minutes of additional user data. Finally, we demonstrate end-to-end real-time on-device speech recognition on a commercial multi-core microcontroller unit (MCU), achieving an energy consumption of 63.9$μ$J per inference with a latency of 2.47 ms. With a total power consumption of 20.5mW for acquisition, inference, and wireless transmission of results, SilentWear enables continuous operation for more than 27 hours.

Abstract PDF Upgrade to Chat

Summary

The paper introduces SilentWear, a textile-based neckband with 14 dry polymer-elastomer EMG sensors that eliminate the need for conductive gels.
It leverages a lightweight CNN (SpeechNet) with 15K parameters and rapid incremental fine-tuning to achieve up to 84.8% accuracy across varying sessions.
The system demonstrates low-latency, energy-efficient on-device inference (2.47 ms per 800 ms window) on GAP9, enabling real-time silent speech interaction.

SilentWear: Ultra-Low Power Textile-Based EMG Silent Speech Recognition

System Design and Wearability

SilentWear introduces a fully wearable, textile-based neckband for surface EMG acquisition, engineered for unobtrusiveness and comfort. The neckband integrates 14 differential EMG channels using dry polymer-elastomer electrodes with Ag/AgCl coating, eliminating the need for conductive gels and facilitating immediate daily use. The electrodes are strategically arranged to maximize infrahyoid muscle coverage, including the sternocleidomastoid region, critical for phonation and articulation. The acquisition hardware, based on the BioGAP-Ultra platform, provides simultaneous 16-channel biopotential acquisition, embedded deep neural network inference, and BLE wireless transmission within a compact $26\times65\times13$ mm configuration powered by a 150 mAh Li-Po battery.

Figure 1: SilentWear neckband and hardware overview with 14 EMG acquisition channels, leveraging a textile-based form factor for daily use.

Data Collection Protocol and Dataset

The data collection protocol focuses on realistic usage, employing three sessions per subject conducted across multiple days with sensor repositioning between sessions to emulate practical variability. Each session comprises alternating vocalized and silent batches, with subjects executing 8 HMI-oriented commands in random order, repeated 20 times each. The dataset includes four subjects, yielding 2400 vocalized and 2400 silent samples per subject, augmented by rest periods.

Figure 2: Protocol: multi-day sessions, alternating vocalized/silent batches, eight machine commands, random ordering, and session-level repositioning.

Signal Quality and Channel Analysis

EMG signals acquired during command articulation reveal clear distinctions between speech and rest phases in both vocalized and silent modes. Central channels consistently exhibit higher SNR, attributed to the infrahyoid muscle placement, which is physiologically relevant for capturing speech-related activity. Lateral channels provide additional coverage, particularly for sternocleidomastoid engagement.

Figure 3: Visualization of EMG channel signals across 8 commands—marked SNR and activation differences between vocalized (top) and silent (bottom) modalities.

Architecture and Inference: SpeechNet

SpeechNet is a lightweight CNN (15,489 parameters), inspired by EpiDeNet, employing initial temporal convolutions per channel followed by cross-channel spatial integration. The model processes 14-channel input windows ( $1400\,\mathrm{ms}$ ), extracting both temporal and spatial neuromuscular patterns. Training utilizes cross-entropy loss with Adam optimizer, leveraging early stopping and learning rate scheduling. Comparative benchmarking against Random Forest classifiers demonstrates substantial performance improvements in both vocalized and silent speech.

Evaluation: Accuracy and Robustness

Three evaluation settings are adopted:

Global (Leave-One-Batch-Out): All sessions pooled; tests unseen batches within sessions
Inter-Session (Leave-One-Session-Out): Tests robustness to sensor repositioning and day-to-day variance
Incremental Fine-Tuning: Assesses accuracy recovery via brief recalibration (less than 10 minutes of user data)

SpeechNet achieves $84.8\pm4.6\%$ (vocalized) and $77.5\pm6.6\%$ (silent) average accuracy (global), and $71.1\pm8.3\%$ / $59.3\pm2.2\%$ (inter-session), outperforming Random Forest baselines by 11–15%. Performance degradation from sensor repositioning is mitigated by incremental fine-tuning, restoring more than 10% accuracy with minimal additional data collection.

Figure 4: Subject-specific confusion matrices for global and inter-session settings—highlighting robust speech/rest discrimination and session-induced variability.

Information Transfer Rate (ITR) and Window Size Ablation

The ITR analysis ablates EMG window sizes (400–1400 ms), balancing accuracy and throughput. Peak accuracy occurs at $1400\,\mathrm{ms}$ , while maximal ITR (92.7 bit/min vocalized, 69.2 bit/min silent) is observed at $800\,\mathrm{ms}$ , providing a trade-off between latency and reliability. Fine-tuning with as little as one batch yields significant accuracy gains (up to 12%), validating the practicality of rapid recalibration.

Figure 5: ITR (bit/min) and accuracy as functions of window size for vocalized speech—demonstrating throughput/accuracy trade-offs.

Figure 6: ITR and accuracy for silent speech—shorter window yields near-peak ITR with moderate accuracy loss.

Incremental Training and Calibration

Incremental fine-tuning from pre-trained models achieves superior mean accuracy compared to both zero-shot evaluation and training from scratch. After four to five fine-tuning rounds (10 minutes), vocalized accuracy improves to 84%, silent to 79%, with one round sufficing for substantial gains (e.g., 70% to 80% vocalized).

Figure 7: Incremental training—fine-tuning delivers consistent accuracy improvement relative to scratch/zero-shot models.

Figure 8: Silent speech incremental training mirrors vocalized trends, underscoring fine-tuning efficacy.

On-Device Deployment and Power Analysis

SpeechNet is deployed on GAP9 (RISC-V multi-core + NE16 accelerator), with 8-bit quantization reducing model footprint to 15.13 kB. The $800\,\mathrm{ms}$ input window yields $2.47\,\mathrm{ms}$ inference time and 63.9 $\mu$ J per inference (energy-optimal). Aggregate system power is $20.5\,\mathrm{mW}$ (acquisition, inference, transmission), allowing 27.1 hours continuous operation per charge. Sliding window inference every 100 ms ensures real-time responsiveness for HMI use-cases.

Comparative Analysis and Context

SilentWear advances the state-of-the-art in EMG-based SSI by combining textile-based dry electrode neckband, extensive multi-day dataset collection with session variability, and embedded on-device deep learning. Comparisons with prior work highlight advantages in wearability, session robustness, dataset scale, and privacy-preserving local inference. Existing solutions (e.g., face-based, tattoo, and headphone-based form factors) often lack embedded AI, operate under restrictive conditions, or do not address session-induced domain shifts. SilentWear uniquely targets practical, everyday use with rigorous session-level generalization and rapid real-world recalibration.

Practical Implications and Future Directions

The implementation achieves low-latency, privacy-preserving, real-time silent speech recognition viable for wearable HMI in noisy or sensitive environments. Practical implications include restoration of communication for speech-impaired users and seamless machine interaction in public or industrial settings. The open release of hardware, software, and dataset supports further development and benchmarking. Four key open directions:

Dataset expansion to larger and more diverse subject pools, extended vocabulary, and continuous speech
Integration of multimodal sensor fusion (e.g., IMU, EEG) to enhance robustness and reliability
Exploration of advanced models (transformers, foundation models) as dataset scale enables
Real-time, closed-loop evaluations for adaptive recalibration in daily-life scenarios

Conclusion

SilentWear presents a fully functional, wearable SSI platform for EMG-based silent and vocalized speech recognition, integrating a textile neckband and embedded low-power deep learning processing. SpeechNet improves accuracy over prior baselines and provides rapid adaptation to session variability. Real-time inference on GAP9 demonstrates feasibility for continuous use with minimal power requirements, supporting privacy and responsiveness. The open-source ecosystem established by this work positions it as a foundation for further research in wearable biosignal-based HMI, silent speech interfaces, and adaptive edge-AI systems.

Markdown Report Issue