Self-Supervised EEG Foundation Models

Updated 22 February 2026

Self-supervised foundation models for EEG are large-scale neural architectures trained on unlabeled data using domain-specific self-supervised learning to capture invariant brain signal patterns.
They leverage techniques like masked autoencoding, contrastive and hybrid objectives, and geometry-aware encoding to enhance interpretability, efficiency, and robustness.
These models enable sample-efficient transfer across clinical diagnostics, BCI control, and cognitive state decoding while adapting to varied sensor layouts and noisy conditions.

Self-supervised foundation models for EEG are large-scale neural architectures pre-trained on vast unlabeled EEG corpora using domain-adapted self-supervised learning (SSL) objectives. These models generate universal, highly transferable representations that can be specialized for diverse downstream tasks—ranging from clinical diagnostics and brain-computer interface (BCI) control to cognitive state decoding—by fine-tuning with limited labeled data. Recent advances have established EEG foundation models as a central paradigm in scalable, sample-efficient neural decoding, with increasing emphasis on architectural diversity (transformers, state-space models, geometry-aware encoders), task-appropriate pretext objectives, and robust transfer protocols across subject, device, and task boundaries.

1. Pretraining Objectives and Self-Supervision

Self-supervised foundation models for EEG predominantly adopt generative masked autoencoding strategies, but contrastive and hybrid objectives are also employed. The canonical SSL pipeline involves masking or corrupting spatiotemporal sub-regions of the input (channels, time, channel-time patches) and enforcing reconstruction or predictive alignment via neural architectures optimized for EEG’s statistical structure.

Principal SSL objectives:

Masked Autoencoding (MAE/MaskRec): Models reconstruct masked segments—either raw time series, spectral amplitudes, or quantized codes—using architectures such as Transformers (Chen et al., 2024, Zhou et al., 2024), VQ-VAEs (Chen et al., 2024, Bettinardi et al., 13 Mar 2025), or hybrid CNN-Transformers (Kuruppu et al., 15 Jul 2025). The loss may be MSE, Smooth-L1, or cosine similarity on masked elements.
Tokenization and Discrete Representation: Several models perform discrete vector-quantized (VQ) encoding of EEG patches prior to masked prediction, enhancing interpretability and compressibility, e.g., EEGFormer (Chen et al., 2024), BioSerenity-E1 (Bettinardi et al., 13 Mar 2025), BrainOmni (Xiao et al., 18 May 2025), HEAR (Chen et al., 14 Oct 2025), CodeBrain (Ma et al., 10 Jun 2025).
Contrastive Learning: Some models complement generative losses with global-discriminative InfoNCE objectives, notably in hybrid designs like CoMET (Li et al., 30 Aug 2025) or CodeBrain (Ma et al., 10 Jun 2025), using augmentations that perturb spectral bands, electrode layout, or temporal order.
Geometry- and Domain-Guidance: Inspired by Riemannian geometry, EEG-ReMinD reconstructs sequences of covariance matrices in SPD space, using attention mechanisms defined by Log-Euclidean distances and geodesic means (Wang et al., 14 Jan 2025). Knowledge-guided objectives enforce spectral grounding by augmenting waveform reconstruction with explicit band-power loss terms (Kommineni et al., 2024).

Architectural support for SSL objectives: Models deploy a diverse toolbox—multi-head attention, 3D coordinate embeddings, state-space models (S4, Mamba-2), spectral/temporal/frequency-specific encoders—to ensure alignment between pretext task and the underlying EEG physiology.

2. Model Architectures: Design Patterns and Innovations

The architectural landscape is dominated by:

Transformer-based Models: Including Vision Transformers with patch-tokenization (Chen et al., 2024, Zhou et al., 2024, Chen et al., 11 Feb 2025), standard Transformers with channel-wise or spatiotemporal self-attention, and decoupled branch designs for spectral and temporal encoding (Ma et al., 10 Jun 2025, Chen et al., 14 Oct 2025). Dual-axis attention (temporal + spectral) is specifically highlighted in LCM (Chen et al., 11 Feb 2025).
State-space Models (SSMs): Mamba-2 and S4-based encoders offer linear sequence scaling, favorable for long-context or real-time BCI (Hong et al., 25 Feb 2025, Kommineni et al., 2024, Panchavati et al., 2 Sep 2025). These models natively handle the continuous-time and low SNR regime typical of EEG.
Geometry-aware and Montages-flexible Encoders: EEG-ReMinD incorporates 3D geometric positional encoding; HEAR utilizes coordinate-based spatial embeddings and spatially-biased transformer blocks, directly supporting heterogeneous EEG device layouts (Wang et al., 14 Jan 2025, Chen et al., 14 Oct 2025). LUNA’s latent-query architecture enables arbitrary electrode geometry while maintaining linear computational cost (Döner et al., 25 Oct 2025).

Modularity: Several reports note the modularity of their encoders and attention/spatial blocks for plug-and-play adaptation to custom tasks (e.g., motor imagery, emotion recognition, seizure detection) with minimal architectural changes (Wang et al., 14 Jan 2025, Hong et al., 25 Feb 2025, Chen et al., 14 Oct 2025).

3. Pretraining Corpora, Data Protocols, and Generalization

Scale and diversity: Pretraining leverages extensive clinical and research EEG datasets (e.g., Temple University Hospital (TUH)/TUEG, Siena, SEED, HBN, Neurophy-FR1), sometimes aggregating over 20,000 hours of data and >10,000 subjects (Chen et al., 2024, Wang et al., 14 Jan 2025, Xiao et al., 18 May 2025, Döner et al., 25 Oct 2025). Montage heterogeneity (8–1,132 channels), sampling rates (125–5,000 Hz), and sensor types (EEG, MEG, and mixed EMEG) necessitate spatially-agnostic encoders or device-aware spatial modules (Chen et al., 14 Oct 2025, Xiao et al., 18 May 2025, Döner et al., 25 Oct 2025).

Preprocessing: Uniform application of bandpass filtering (0.1–100 Hz typical), downsampling (125–256 Hz), artifact rejection, and 3D coordinate mapping consolidate diverse raw recordings for cross-cohort pretraining (Portmann et al., 3 Feb 2026).

Task transfer and adaptation: Foundation models are typically evaluated under cross-dataset, cross-task, and device transfer settings. Zero-shot generalization is achieved in models with rigorous device/montage handling (Xiao et al., 18 May 2025, Döner et al., 25 Oct 2025, Chen et al., 14 Oct 2025).

4. Fine-Tuning, Probing, and Downstream Applications

Transfer protocols:

Linear probing: Freezing the encoder and training only a lightweight task head (typically SVM or shallow MLP) is standard for rapid low-data adaptation and for quantifying representation quality (Chen et al., 2024, Ma et al., 10 Jun 2025).
Full and partial fine-tuning: For more challenging domains or to exploit additional label information, partial or full encoder fine-tuning is applied, especially in cross-subject or distribution-shifted settings (Wang et al., 14 Jan 2025, Chen et al., 11 Feb 2025, Wang et al., 30 Sep 2025).

Downstream tasks:

Clinical disease diagnosis (seizure detection, abnormal/normal EEG, Parkinson’s, Alzheimer’s, depression)
Cognitive/BCI tasks (motor imagery, emotion, workload)
Event decoding (sleep staging, artifact/event detection)
Anomaly detection in continuous monitoring (Chen et al., 2024, Bettinardi et al., 13 Mar 2025, Ma et al., 10 Jun 2025, Kuruppu et al., 15 Jul 2025, Döner et al., 25 Oct 2025)

Performance: Generative masked autoencoder methods consistently outperform contrastive-only methods for clinical diagnosis and sleep tasks; hybrid objectives (e.g., masked + contrastive in CoMET) expand attention diversity and enhance global pattern recognition (Shen et al., 12 Feb 2026, Li et al., 30 Aug 2025). Geometry-aware and channel-adaptive models (HEAR, LUNA, BrainOmni) maintain or improve balanced accuracy in variable layout and device scenarios (Chen et al., 14 Oct 2025, Döner et al., 25 Oct 2025, Xiao et al., 18 May 2025).

5. Robustness, Efficiency, and Interpretability

Noise and artifact resistance: Label efficiency is repeatedly demonstrated: downstream performance with as little as 10% labeled data is reported, with robust generalization under data corruptions (masking, dropout, artifact injection) and gradual performance degradation under severe corruption (Wang et al., 14 Jan 2025, Bettinardi et al., 13 Mar 2025, Hong et al., 25 Feb 2025).

Topology and device generality: Coordinate- or query-based spatial modules enable a single model to handle unseen sensor layouts, removing the need for montage-specific fine-tuning (Chen et al., 14 Oct 2025, Döner et al., 25 Oct 2025, Xiao et al., 18 May 2025).

Efficiency and scaling: State-space models (Mamba-2, S4) and cross-attention-based latent-query architectures reduce inference memory and time, enabling deployment on edge devices or real-time BCI (Hong et al., 25 Feb 2025, Döner et al., 25 Oct 2025). Empirically, LUNA reports a 300× FLOPs reduction and 10× memory savings over full attention models (Döner et al., 25 Oct 2025).

Interpretability: Discrete codebooks and quantized tokens (EEGFormer, BrainOmni, BioSerenity-E1, CodeBrain) can be mapped to characteristic motifs or clinical events, with n-gram or class-association analyses highlighting neurophysiologically meaningful latent structure (Chen et al., 2024, Bettinardi et al., 13 Mar 2025, Xiao et al., 18 May 2025, Ma et al., 10 Jun 2025). Geometry-guided attention further clarifies the anatomical basis of learned connectivity features (Wang et al., 14 Jan 2025, Chen et al., 14 Oct 2025). Per-channel decoding enables topographical saliency attribution (Sukhbaatar et al., 22 Sep 2025).

6. Current Limitations and Prospective Directions

Dataset and benchmark standardization: The field remains dependent on a small number of large clinical datasets (e.g., TUH/TUEG), with limited representation of healthy, cognitive, or multimodal data. There is consensus for the need of “EEG-bench” frameworks with unified data splits and metrics across diagnosis, sleep, BCI, and artifact detection (Portmann et al., 3 Feb 2026, Ma et al., 10 Jun 2025, Kuruppu et al., 15 Jul 2025).

Architecture and task alignment: While masked autoencoding remains dominant, exploration of multi-task, autoregressive, and contrastive-hybrid objectives is ongoing. The relationship between model depth, codebook granularity, domain augmentation, and physiological relevance is not fully understood (Shen et al., 12 Feb 2026, Li et al., 30 Aug 2025).

Cross-modality and multi-signal expansion: Integration with MEG, fNIRS, EMG, textual reports, and behavioral data is being realized (e.g., BrainOmni and proposals for multimodal codebooks), but scaling and alignment methods require further investigation (Xiao et al., 18 May 2025, Chen et al., 14 Oct 2025).

Robustness and adaptive transfer: Methods such as domain-specific self-supervised alignment (NeuroTTT) and entropy-minimization test-time adaptation directly address remaining domain shift and pretrain–downstream misalignment, improving generalization under real-world variability (Wang et al., 30 Sep 2025).

Interpretability and trustworthiness: There is growing attention to physiologically interpretable feature spaces, modular pipelines, and explainability via codebook visualization or channel-wise saliency (Chen et al., 2024, Ma et al., 10 Jun 2025, Sukhbaatar et al., 22 Sep 2025).

Scaling and efficiency: Emergent evidence points to a regime of diminishing returns in performance with respect to both model size and data volume beyond certain thresholds, with gains more robustly achieved by improving architectural and domain alignment (Kuruppu et al., 15 Jul 2025, Shen et al., 12 Feb 2026). Sparse or state-space sequence models and cross-attention-based “compression” approaches (LUNA) are favored for scalable, topology-agnostic deployment.

In summary, self-supervised foundation models for EEG now form the backbone of robust, sample-efficient, and generalizable neural decoding pipelines. State-of-the-art approaches combine masked generative pretraining, geometry- and modality-aware design, and extensive multi-corpus pretraining. These models demonstrate strong transferability across a spectrum of EEG analytics, resilience to corruptions and device variability, and increasing degrees of physiological interpretability (Wang et al., 14 Jan 2025, Chen et al., 2024, Xiao et al., 18 May 2025, Chen et al., 14 Oct 2025, Ma et al., 10 Jun 2025, Portmann et al., 3 Feb 2026, Wang et al., 30 Sep 2025). Future research is poised to expand coverage to further modalities, standardize multi-task benchmarks, refine architectural–domain alignment, and deepen physiological integration for clinically and scientifically robust EEG AI systems.