Diotic EEG Dataset for Auditory Decoding

Updated 30 January 2026

The diotic EEG dataset is a collection of neural recordings from subjects exposed to identical binaural stimuli, enabling content-based attention analysis.
It includes both high-density scalp and in-ear EEG data with standardized preprocessing to align neural and speech features effectively.
These datasets demonstrate robust auditory decoding performance, achieving up to 73% accuracy in challenging 'cocktail party' scenarios.

A diotic EEG dataset comprises simultaneous recordings of neural activity captured via electroencephalography (EEG) from subjects exposed to diotic auditory stimuli—two-talker speech mixtures presented identically to both ears, thus devoid of binaural spatial cues. This dataset serves as a foundational resource for research in auditory attention decoding (AAD), especially in the context of environments that replicate real-world auditory scenes, such as the "cocktail party" scenario, where traditional spatial separation is unavailable. Diotic datasets enable investigation into content-based attentional mechanisms by circumventing the confounds inherent to directional encoding.

1. Core Dataset Architectures and Participant Protocols

Two publicly documented diotic EEG datasets are prominent: one employing high-density scalp EEG (Yoshino et al., 23 Jan 2026) and another using a custom ultra-wearable ear-EEG system (Thornton et al., 2024).

Scalp EEG Dataset (Yoshino et al., 23 Jan 2026):

28 healthy adults (20 male, 8 female) each participated in 120 sessions, with every session delivering a continuous 64-second diotic speech mixture derived from pitch-matched male/female voices. EEG was recorded from 32 scalp electrodes (10–20 system), initially referenced to Cz, and re-referenced offline to the Common Average Reference (CAR). Each session yielded 64 one-second, 21 three-second, or 12 five-second decision windows (trials), resulting in 7680 seconds raw data per subject.

Ear-EEG Dataset (Thornton et al., 2024):

18 normal-hearing young adults (median age 23) contributed 16 trials each, with trials averaging ~150 seconds duration. Speech mixtures were audiobook excerpts narrated by fixed male and female voices, summed and delivered diotically at 75 dB SPL via headphones. EEG signals were acquired using two dry-contact in-ear electrodes (one per ear), with FT7 as the scalp reference and right earlobe as ground. Each trial block instructed subjects to attend a specified narrator, alternating every four trials.

Table: Diotic EEG Dataset Comparison

Feature	Scalp EEG (Yoshino et al., 23 Jan 2026)	Ear-EEG (Thornton et al., 2024)
Subjects	28 (20M/8F)	18 (10M/8F)
Recording Method	32-channel scalp (10–20)	2-channel in-ear + FT7 reference
Sessions/Trials	120 x 64 s	16 x ~150 s
Stimulus	Diotic, pitch-matched voices	Diotic, fixed voices
Window Sizes	1, 3, 5 s (non-overlapping)	1–10 s (non-overlapping)

2. Auditory Stimulus Construction and Signal Features

The diotic presentation ensures both ears receive identical mixtures of speech, effectively quenching directional cues and mandating attentional differentiation based on speech content or acoustic features alone.

In (Yoshino et al., 23 Jan 2026), stimuli sourced from Stoll et al. (diotic subset) involved pitch-lowered female voices to match male fundamental frequency, with mixtures produced by direct waveform summation.
In (Thornton et al., 2024), audiobook chapters narrated by a male and female were algorithmically summed. The temporal envelope was extracted using a gammatone filterbank into 28 sub-bands (center frequencies 50 Hz–5 kHz, ERB scale), with half-wave rectification and sub-band averaging. Onset envelopes were computed via $e_{onset}(t) = \max(0, \frac{d}{dt}e(t))$ , both resampled to 64 Hz for later alignment.

3. EEG Data Acquisition and Preprocessing Pipelines

Standardized signal processing pipelines ensure correspondence between neural and stimulus features.

Scalp EEG Pipeline (Yoshino et al., 23 Jan 2026):

Unit conversion: Volts to microvolts.
FIR bandpass filtering (0.5–32 Hz).
Re-referencing to Common Average Reference (CAR).
Downsampling from 10 kHz to 64 Hz.
Epoching via non-overlapping sliding windows of 1, 3, or 5 s.

Ear-EEG Pipeline (Thornton et al., 2024):

High-pass filtering at 0.5 Hz (Type-2 Hamming-sinc, order 1691).
EEG-microphone cross-correlation for precise auditory stimulus alignment.
Resampling all channels/features to 64 Hz.
Per-channel standardization (zero mean, unit variance).

For both, speech features are either deep model activations or temporal envelope/onset features (wav2vec 2.0 for scalp EEG; envelope-based for ear-EEG).

4. Model Input Representations and Alignment Strategies

Diotic EEG decoding models necessitate explicit temporal and feature alignment of neural and auditory signals.

In (Yoshino et al., 23 Jan 2026), the speech encoder applies wav2vec2-large-960h, retaining activations from the 14th layer, reducing dimensions via PCA (1024 → 64), and resampling to EEG rate. The encoder output for speech ( $Z_i = f_s(S_i; \theta_s) \in \mathbb{R}^{F \times T}$ , $F=64$ , $T=L\!\cdot\!64$ ) is matched by the EEG encoder ( $Z_E = f_E(E; \theta_E) \in \mathbb{R}^{F \times T}$ ), which implements spatial attention plus 5-block ResNet. Weighted channel-wise scaling and vectorization yield temporally aligned representations ( $v_\ell = \mathrm{vec}(\tilde Z_\ell) \in \mathbb{R}^{F \cdot T}$ ).
In (Thornton et al., 2024), feature-to-EEG mapping is computed via temporal response functions (TRF) over Toeplitz matrices with lag windows. Ridge regression solutions are applied both "forward" ( $\theta = (X^\top X + \lambda I)^{-1} X^\top y$ ) and "backward" for stimulus reconstruction.

5. Attention Decoding, Similarity Measures, and Decision Rules

Decoding attended speech in diotic datasets relies on statistical and neural-network-based approaches.

Cosine Similarity Approach (Yoshino et al., 23 Jan 2026):
- Similarity scores: $s_i = \mathrm{sim}(v_E, v_i) = \frac{v_E^\top v_i}{\|v_E\|\,\|v_i\|}$
- Selection via temperature-softmax: $p_i = \exp(s_i/\tau) / [\exp(s_1/\tau) + \exp(s_2/\tau)]$ , $\tau = 0.05$
- Cross-entropy loss for supervision; $\hat{y} = \arg\max_i p_i = \arg\max_i s_i$
Linear and Canonical Approaches (Thornton et al., 2024):
- TRF models use Toeplitz stimulus matrices over defined lag windows, with causal lags for envelope reconstruction.
- CCA (canonical correlation analysis) provided highest mean decoding accuracy for ear-EEG, though differences among linear and non-linear approaches were not significant for short EEG windows.

6. Evaluation Methodologies and Performance Metrics

Cross-Validation (Yoshino et al., 23 Jan 2026):

Train/val/test split: 20/4/4 subjects; 7-fold cross-validation excludes subject overlap.

Metrics:

Accuracy is the fraction of correctly decoded segments over total segments. For scalp EEG, 5 s windows achieved 72.70% accuracy (mean ± std), outstripping retrained direction-based models (DARNet), which remained at chance (50.12%). Window length modulated efficacy (1 s: 60.80%, 3 s: 68.65%, 5 s: 72.70%).

Ear-EEG (Thornton et al., 2024):

Decoding algorithms were evaluated on non-overlapping windows of 1–10 s. CCA attained highest mean accuracy, with onset-envelope features yielding improved performance compared to temporal envelope alone.

7. Neuroscientific Insights, Dataset Limitations, and Accessibility

Neuroscientific Control:

Match-Mismatch analyses in (Yoshino et al., 23 Jan 2026) affirmed early acoustic encoding in diotic AAD is unaffected by attention (∼84% for both streams). SHAP interpretability flagged frontal channel up-weighting under attention decoding.

Limitations:

Both datasets are limited to two speakers; generalization to ≥3 simultaneous speakers remains untested. Residual acoustic cues (timbre, formant) may aid segregation; the scalp EEG dataset (Yoshino et al., 23 Jan 2026) is the only publicly available full-length diotic set for high-density arrays.

Dataset Accessibility (Thornton et al., 2024):

Ear-EEG data, provided under Zenodo DOI 10.5281/zenodo.10260082, include raw/preprocessed EEG at 256/64 Hz, speech envelopes, onset envelopes, metadata, alignment triggers, and in-ear microphone traces. Standard formats: NumPy arrays, MATLAB .mat files.

8. Significance in Auditory Attention Research

Diotic EEG datasets have critically advanced understanding of content-based attentional selection, demonstrating that AAD in the absence of spatial cues is tractable and robust (∼73% accuracy in 5 s windows for latent space-based models) (Yoshino et al., 23 Jan 2026). These resources provide ground truth for the development of smart hearing aids and objective audiometry systems that function in spatially ambiguous or dynamic listening environments, and establish rigorous benchmarks for models leveraging high-dimensional speech and neural representations.

Markdown Report Issue Upgrade to Chat

References (2)

Auditory Attention Decoding without Spatial Information: A Diotic EEG Study (2026)

Comparison of linear and nonlinear methods for decoding selective attention to speech from ear-EEG recordings (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Diotic EEG Dataset.