emg2qwerty: sEMG Typing Dataset

Updated 13 January 2026

emg2qwerty is a large-scale public corpus of non-invasive surface EMG recordings captured from wrist muscles during natural QWERTY typing by over 100 users.
It provides densely annotated data with synchronized keystroke events and advanced preprocessing techniques, including high-pass filtering and spectral feature extraction.
The dataset serves as a benchmark for neural interface research, enabling evaluation of CTC-based models and innovative architectures like SplashNet and MyoText for text decoding.

The emg2qwerty dataset is a large-scale, publicly available corpus of non-invasive surface electromyographic (sEMG) recordings acquired from the wrists of human users during natural touch typing on a QWERTY keyboard. Designed to model and evaluate the problem of character-level text reconstruction directly from neuromuscular activity, emg2qwerty provides synchronized, densely annotated sEMG and keyboard event data spanning over 100 users and more than 1,000 individual recording sessions. Its construction, annotation protocol, baseline modeling strategies, and derived benchmarks have made it the reference dataset for machine learning and neural interface research in the context of wrist-based, always-available human-computer input (Sivakumar et al., 2024, Hadidi et al., 14 Jun 2025, Chowdhury et al., 6 Jan 2026).

1. Dataset Composition and Acquisition

emg2qwerty encompasses 108 users, each contributing to a total of 1,135 recording sessions and 346.4 hours of sEMG data (average 3.2 hours per user, 18 minutes per session). The aggregate corpus consists of approximately 5.26 million keystrokes, implying an average group typing rate of 4.4 keys/s (Sivakumar et al., 2024, Hadidi et al., 14 Jun 2025).

Each participant wore two sEMG-RD dry-electrode research wristbands (one per wrist), each outfitted with 16 differential channels, resulting in 32 sEMG channels at a sampling rate of 2 kHz (12-bit ADC, ±6.6 mV input range). Electrode arrays formed a gold-plated ring around the circumference of the wrist and were left-right mirrored. On-device analog filtering enforced a 20–850 Hz bandpass (–3 dB).

Participants typed sentences on a physical Apple Magic Keyboard (US English, QWERTY layout) in response to prompts drawn from both random word lists and Wikipedia (with profanity and most punctuation filtered, case-reduced). Keyboard interaction was captured by a keylogger recording key-down/key-up events, with synchronization achieved by aligning the system’s keystroke timestamps to the sEMG clock via a drift-correction procedure (alignment resolution 0.5 ms). To emulate realistic electrode placement variability, each session began with a doff/don cycle of the wristbands (Sivakumar et al., 2024).

2. Data Structure, Organization, and Annotation

Each session is stored as a distinct HDF5 file alongside a metadata.csv index, yielding a 1:1 mapping of recording sessions to data artifacts. Principal components include:

/emg/left, /emg/right: $[\text{#samples} \times 16]$ arrays of raw sEMG voltages.
/keys/text: the prompted text string.
/keys/events: tuples (key, key-down time, key-up time), ground-truth for alignment.
/timestamps: mapping between sEMG and keylogger times.

The metadata CSV provides session-level quality flags, user identifiers, wristband size, self-reported typing fluency, and session durations. A BIDS-conversion script is supplied for compatibility with standard neuroscience data analysis pipelines. Data and code are hosted under a CC-BY-NC-SA 4.0 license on GitHub and S3 (Sivakumar et al., 2024).

Annotations correspond to individual key events—spanning letters, digits, punctuation, Backspace—without explicit gesture-level labels, adhering to a character-level recognition taxonomy. All key event sequences are temporally aligned to the sEMG waveform timeline.

Sessions are segmented for modeling and evaluation according to the experimental paradigm, with baseline models using 4-second windows (900 ms left, 100 ms right context), and test evaluations computed on entire session runs (Sivakumar et al., 2024, Hadidi et al., 14 Jun 2025).

3. Preprocessing and Feature Extraction

emg2qwerty implements a multi-stage signal preprocessing workflow:

A 40 Hz high-pass digital filter is applied post-acquisition, augmenting the on-device 20 Hz filter.
Sessions with gross artifacts or prolonged dropouts are excised.
For modeling, each channel is transformed into a spectral feature vector: a short-time Fourier spectrogram is computed every 8 ms with a 32 ms window, yielding 33 log-spaced frequency bins up to 1 kHz.

For each time point $t$ and channel $c$ :

$x_t^{(c)}(f) = \log \left| \text{STFT}[s^{(c)}](t, f) \right|$

Features from all 32 channels are concatenated to $x_t \in \mathbb R^{1056}$ , and batch-normalized per channel. An additional rotation-invariant feature module and various data augmentations (SpecAugment, channel shift, left/right wrist alignment jitter) are used to regularize the input space (Sivakumar et al., 2024, Hadidi et al., 14 Jun 2025).

Recent analyses have advocated for reduced spectral granularity—aggregation of the 33 original frequency bins into 6 broader bands—to enhance cross-user generalization. Rolling time normalization (causal per-feature z-scoring with warmup) and aggressive channel masking (zeroing up to 55% of features in training) further address domain shift and promote low-order feature reliance (Hadidi et al., 14 Jun 2025).

4. Official Splits, Domain Shift, and Evaluation Protocol

Data partitioning enforces rigorous separation for zero-shot and personalization evaluation. The canonical scheme subdivides the 108 users as follows (Sivakumar et al., 2024, Hadidi et al., 14 Jun 2025):

Split	# Users	Sessions Used	Purpose
Training Pool	100	All but last 2 (for ≥4-session users)	Generic pretraining
In-domain Validation	96	Final 2 sessions per user	Model/parameter selection
Other-domain Validation	4	Final session per user	Zero-shot proxy validation
Unseen Test Pool	8	All sessions	Zero-shot + finetune

Zero-shot evaluation trains on the 100-user pool and tests on the 8 held-out. Personalization benchmarks fine-tune models on each test user’s first two sessions and evaluate on their remaining unseen sessions. For fair comparison, all results report both without and with a 6-gram character LLM (modified Kneser–Ney, trained on WikiText-103) (Sivakumar et al., 2024).

Domain shift is significant: generic zero-shot models exhibit CER ≈ 52% ( $\pm$ 4%) due to inter-user variability (anatomy, muscle synergies, style) and session variance due to electrode placement (∼4% CER swing within user; session-specific). The annotated user/session/window/time/motor-unit hierarchy expresses the generative dependencies and motivates population-level pretraining followed by user-specific adaptation (Sivakumar et al., 2024, Hadidi et al., 14 Jun 2025).

5. Baseline and State-of-the-Art Modeling Approaches

The emg2qwerty benchmark established a reference ASR-inspired CTC paradigm: a temporal convolutional encoder with four TDS blocks, receptive field of 1 s, and a rotation-invariant spectral input module. The model is optimized via CTC loss:

$L_\text{CTC}(\theta) = -\log \sum_{\pi \in \mathcal K(y)} \prod_{t=1}^T p(z_t^{\pi_t})$

where $p(z_t^{k}) = \mathrm{softmax}(z_t)_k$ , and $\mathcal K(y)$ is the set of all possible labelings of the target sequence $y$ with blanks.

A beam-search decoder is augmented with a LLM and explicit Backspace handling. The best generic (zero-shot) model achieves a test CER of 51.8% ± 4.6% with the LLM, 55.4% ± 4.1% without. Personalization substantially lowers error: 6.95% ± 3.6% (fine-tuned, +LM) (Sivakumar et al., 2024).

Advances such as SplashNet introduce:

Split-and-Share Encoders: Two hand-specific streams with weight sharing reflecting bilateral neuromuscular structure.
Rolling Time Normalization: Online causal z-scoring for session and user harmonization.
Aggressive Channel Masking: Randomized feature ablation to prompt generalization.
Reduced Spectral Resolution: 33-to-6 frequency band compression.

SplashNet achieves CERs of 35.7% (zero-shot) and 5.5% (fine-tuned, +LM), outperforming the original baseline by 31% and 21% in the respective settings (Hadidi et al., 14 Jun 2025).

MyoText applies a multi-stage, physiologically inspired approach: finger activations are classified from sEMG streams via a CNN–BiLSTM–Attention network, with QWERTY ergonomic priors constraining letter inference, and final sentences decoded by a transformer (T5). This system attains 85.4% finger classification accuracy and a CER of 5.4% on a subset of 30 users (Chowdhury et al., 6 Jan 2026).

6. Applications, Limitations, and Extensions

emg2qwerty underpins research in neural text decoders, human-machine interface design, and wearables, facilitating benchmarking of both end-to-end and modular architectures. Its character-level alignment and scale permit investigation of fundamental problems including cross-user adaptation, sequence modeling with CTC, and domain shift strategies.

The dataset is specialized to short, QWERTY-typed phrases by touch-typists—limiting assessment of non-expert users, non-standard text, or extended interaction. Demographic details are incompletely reported in subsequent analyses, and session-level factors such as electrode montage variation are controlled only at the doff/don cycle, omitting intra-session dynamics (Sivakumar et al., 2024, Hadidi et al., 14 Jun 2025, Chowdhury et al., 6 Jan 2026). Sampling rate and filtering steps are absent in some secondary studies, potentially impacting reproducibility (Chowdhury et al., 6 Jan 2026).

Recommended best practices include holding out users for personalization, using robust augmentation (rotation, mask, jitter), and exploring extensions via self-supervised pretraining, RNN-T/seq2seq decoding, and raw waveform models (e.g., wav2vec).

7. Accessibility and Licensing

The dataset, accompanying codebase, pretrained model checkpoints, and configuration files for official splits are publicly accessible at https://github.com/facebookresearch/emg2qwerty (CC-BY-NC-SA 4.0). State-of-the-art derivative models such as SplashNet are available at https://github.com/nhadidi/SplashNet (Sivakumar et al., 2024, Hadidi et al., 14 Jun 2025).

Substantial documentation covers directory conventions, artifact structure, and suggested neuroinformatics conversions (BIDS). This enables direct integration within machine learning pipelines and facilitates reproducible, cross-study benchmarking on the problem of decoding text from surface electromyography.