ESM-2 Protein Embeddings Overview

Updated 11 February 2026

ESM-2 Protein Embeddings are high-dimensional representations derived from transformer architectures trained on millions of protein sequences, enabling both residue-level and sequence-level analysis.
They are extracted through tokenization, forward passes, and pooling methods, supporting tasks like structure quality assessment, binding site localization, and function classification.
These embeddings enhance downstream applications such as ligand binding prediction, homology detection, and protein design, providing scalable and interpretable features for advanced bioinformatics.

ESM-2 Protein Embeddings provide high-dimensional, learned representations of protein sequences via large-scale transformer architectures optimized for biological sequence modeling. Building upon the Evolutionary Scale Modeling (ESM) family, ESM-2 embeddings enable protein-level and residue-level feature extraction that has demonstrated superiority across a range of structural and functional prediction tasks, including structure quality assessment, binding site localization, function classification, sequence generation, and mechanistic interpretability. These embeddings have established themselves as a foundational component in protein informatics pipelines, enabling efficient, scalable, and interpretable analysis of protein sequence space.

1. ESM-2 Model Architecture and Embedding Extraction

ESM-2 models are BERT-style encoder-only transformers, pre-trained on hundreds of millions of protein sequences using masked language modeling (MLM). The architecture includes several size variants, ranging from 6-layer, 8M-parameter models (embedding dimension $d=320$ ) to 48-layer, 15B-parameter models ( $d=2560$ ), with the most commonly used configurations being ESM2-650M (33 or 34 layers, $d=1280$ ) and ESM2-150M (30 layers, $d=640$ ) (Chae et al., 2024, Zhang et al., 2023, Simon et al., 2024). Each amino acid token, along with special tokens (<cls>, <eos>), is embedded and augmented with learned or sinusoidal positional encodings. The model produces, at each transformer block, residue-level embedding matrices $H^{(\ell)} \in \mathbb{R}^{L \times d}$ , where $L$ is the sequence length.

Embedding extraction typically entails:

Tokenization: mapping residues to integer indices using the ESM-2 vocabulary.
Forward pass: inputting the token sequence through the pre-trained transformer stack.
Output selection: retrieving either per-residue embeddings from the final (e.g., $H^{(34)}$ for ESM2-650M or $H^{(6)}$ for ESM2-8M) or pooled per-sequence embeddings via mean-pooling, [CLS] token, or sliding window strategies.
Special cases: for long sequences ( $L>1022$ ), sliding window with length-weighted averaging is applied; domain-level extraction (e.g., using Pfam annotations) may be used to isolate regions of functional interest (Kumar et al., 29 Nov 2025, Oliveira et al., 13 Jan 2025).

The extracted embeddings are high-dimensional ( $d=320$ – $d=2560$ 0 per residue) and can be used directly in downstream architectures, or as inputs for further feature engineering, interpretation, or model fine-tuning.

2. Downstream Applications and Performance Benchmarks

Structural Prediction and Protein Screening

ESM-2 embeddings serve as potent inputs for downstream structure-prediction surrogates. For example, the pLDDT-Predictor utilizes precomputed ESM2-t6-8M-UR50D embeddings ( $d=2560$ 1) as input to a dedicated 6-layer transformer (hidden size $d=2560$ 2, 8 attention heads) to predict residue-level pLDDT quality scores in milliseconds—achieving a Pearson correlation of 0.7891 to AlphaFold2-derived pLDDT and $d=2560$ 3 accuracy for high-confidence structure classification, all at a $d=2560$ 4 speedup over full structure models (Chae et al., 2024).

Ligand Binding Site and Interaction Prediction

The LaMPSite pipeline leverages ESM-2-650M to provide per-residue embeddings and unsupervised contact maps (computed from averaged attention matrices of the top layers). These representations enable accurate protein–ligand binding site predictions without the need for 3D structural input, achieving competitive performance with structure-based methods and facilitating efficient large-scale bioactivity screening (Zhang et al., 2023).

Function Classification and Homology Detection

Mean-pooled or domain-specific ESM-2 embeddings are effective for protein family/interface classification, enzyme/non-enzyme prediction, and remote homology detection (Sajol et al., 24 Oct 2025, Xu, 4 Dec 2025). In kinase functional prediction, averaging embeddings over mid-to-late layers ( $d=2560$ 5– $d=2560$ 6) rather than relying solely on the final layer significantly improves adjusted Rand index (ARI) in unsupervised clustering (from $d=2560$ 7 to $d=2560$ 8) and boosts supervised accuracy from $d=2560$ 9 (final layer) to $d=1280$ 0 (Kumar et al., 29 Nov 2025).

Sequence Generation and Protein Design

ESM-2 latent representations are well suited to both protein design and generative modeling. For example, DiMA uses per-residue ESM-2 embeddings ( $d=1280$ 1) as the latent space for a continuous diffusion model, achieving superior sequence quality, structural diversity, and distributional fidelity over autoregressive and discrete-diffusion baselines on benchmarks such as SwissProt and AFDBv4 (Meshchaninov et al., 2024).

Fine-tuned sparse autoencoders (SAEs) on ESM-2 embeddings enable model steering for targeted protein design. For low-N function prediction, SAE-transformed ESM2-650M embeddings ( $d=1280$ 2, sparsity $d=1280$ 3) generalize effectively from 24 labeled sequences—yielding higher Spearman’s $d=1280$ 4 vs direct ESM-2 baselines and enabling top-variant steering in $d=1280$ 5 of engineering cases (Tsui et al., 25 Aug 2025).

3. Interpretability, Sparse Representations, and Adapter Models

Sparse Autoencoders and Interpretable Features

InterPLM demonstrates that SAEs trained on ESM-2 embeddings can disentangle opaquely superposed biological concepts, discovering up to 2,548 nearly monosemantic features per layer from a base space of 320 (for ESM-2-8M) (Simon et al., 2024). These features correspond to motifs (e.g., kinase active-sites, Nudix boxes), domains (e.g., trypsin peptidase S1), and binding sites, aligning to 143 distinct Swiss-Prot-annotated concepts, versus only 15 recovered by single ESM-2 neurons. Similarly, (Garcia et al., 13 Feb 2025) finds that sparse latents extracted at layer 3 can be statistically associated with protein features (e.g., transmembrane region, zinc finger), and controlled directly to steer sequence generation toward nontrivial structural classes (e.g., generating zinc-finger folds).

Subspace Factorization Adapters

PLM-eXplain (PLM-X) introduces a post-hoc adapter that decomposes ESM-2 embeddings (e.g., $d=1280$ 6 for ESM2-t12-35M) into an interpretable subspace (34 dimensions, e.g., secondary structure, amino-acid identity, hydrophobicity, ASA) and a residual subspace capturing un-modeled variance. The interpretable subspace enables SHAP-based and filter-based biological insight, while the full factorization matches non-adapted ESM-2 in downstream prediction (aggregation, transmembrane helix, EV association), substantially outperforming pure handcrafted baselines (Eck et al., 9 Apr 2025).

4. Model Extensions: Structure Alignment, Quantization, and Long Sequence Handling

Structure-Aligned ESM-2 Embeddings

SaESM2 aligns transformer-derived representations with residue-level protein graph neural network (pGNN) embeddings (e.g., GearNet), via a latent-level InfoNCE loss and a physical-level structure-token prediction task, selecting informative residues via excess-loss ranking. This pairing enhances structural knowledge, yielding a $d=1280$ 7 increase in contact prediction precision (P@L/5: $d=1280$ 8 leap to $d=1280$ 9), improved fold and secondary-structure accuracy, and amplification of mutational effect prediction ( $d=640$ 0 up to $d=640$ 1 for fitness) without sacrificing biochemical modeling (Chen et al., 22 May 2025).

Quantization and Long Sequence Variants

To address sequence length, memory, and efficiency limits, ESM-2 architectures have been extended: "long" ESM-2 supports input up to $d=640$ 2 residues via "context copying" positional embedding and banded local self-attention, while 4-bit (int4) quantized weights offer $d=640$ 3× memory reduction. For proteins $d=640$ 4 residues, these modified models hold or gain in functional prediction accuracy ( $d=640$ 5), particularly in GO branches BPO, CCO, MFO, matching or exceeding standard ESM-2 (Oliveira et al., 13 Jan 2025).

5. Technical Best Practices and Integration in Modern Pipelines

Embedding Selection and Pooling

For protein-level tasks: mean-pooling over residues, potentially across selected layers ( $d=640$ 6– $d=640$ 7) or Pfam-annotated domains, is superior (Kumar et al., 29 Nov 2025).
For residue-level tasks: preserve the full $d=640$ 8 matrix from the relevant ESM-2 layer; avoid PCA or normalization unless required by a downstream model (Zeng et al., 2023).
For sequences exceeding the transformer's nominal limit: Apply sliding window processing with length-weighted averaging (Kumar et al., 29 Nov 2025, Oliveira et al., 13 Jan 2025).

Model Integration

ESM-2 embeddings are now integral to graph-based PPI prediction frameworks, such as DeepDrug Protein Embedding Bank (DPEB), providing context-rich node attributes for GNN-based link prediction, enzyme classification, and family identification on over 22,000 human proteins (Sajol et al., 24 Oct 2025). In nucleic acid-binding residue prediction, ESM2-650M outperforms traditional HMM-derived features both in accuracy (up to +90.7% improvement in MCC/AUC/AP) and speed (order-of-magnitude faster, e.g., 5.52 s for $d=640$ 9) (Zeng et al., 2023).

Semi-supervised and Low-label Regimes

ESM-2 embeddings afford exceptional label efficiency: in antigenicity prediction for influenza A, models trained with only 25% labeled data achieve $H^{(\ell)} \in \mathbb{R}^{L \times d}$ 0 for all major subtypes, robustly exceeding alternatives (ProtVec, ProtBert, ProtT5) and further benefiting from self-training/label-spreading semi-supervised strategies (Xu, 4 Dec 2025). In low-N protein function prediction ( $H^{(\ell)} \in \mathbb{R}^{L \times d}$ 1), sparse-latent versions generalize more effectively to unseen mutants and facilitate motif-focused design (Tsui et al., 25 Aug 2025).

Table: Representative ESM-2 Downstream Performance Metrics

Task/Domain	ESM-2 Variant / Approach	Metric/Value	Reference
High-throughput pLDDT screening	ESM2-t6-8M + 6L Transformer	Pearson $H^{(\ell)} \in \mathbb{R}^{L \times d}$ 2	(Chae et al., 2024)
Ligand-binding site prediction	ESM2-650M + LaMPSite	No 3D req, state-of-art	(Zhang et al., 2023)
Kinase function prediction	ESM2-650M, layers 20–33 pooled	Accuracy $H^{(\ell)} \in \mathbb{R}^{L \times d}$ 3	(Kumar et al., 29 Nov 2025)
Antigenicity (supervised, $H^{(\ell)} \in \mathbb{R}^{L \times d}$ 4)	ESM2-150M mean-pooled	$H^{(\ell)} \in \mathbb{R}^{L \times d}$ 5	(Xu, 4 Dec 2025)
Protein family classification	ESM2-12L, community clustering	$H^{(\ell)} \in \mathbb{R}^{L \times d}$ 6 accuracy	(Jiao et al., 2024)
Enzyme/non-enzyme (GNN/DPEB)	ESM2-650M, mean-pool	$H^{(\ell)} \in \mathbb{R}^{L \times d}$ 7 accuracy	(Sajol et al., 24 Oct 2025)
Structural contact prediction	SaESM2 (ESM2+GNN)	P@L/5 $H^{(\ell)} \in \mathbb{R}^{L \times d}$ 8	(Chen et al., 22 May 2025)
PPI link prediction	ESM2-650M + GraphSAGE	AUROC $H^{(\ell)} \in \mathbb{R}^{L \times d}$ 9	(Sajol et al., 24 Oct 2025)
Nucleic-acid binding (MCC, DNA)	ESM2-650M + BiLSTM/MLP	$L$ 0	(Zeng et al., 2023)

6. Interpretations, Limitations, and Emerging Directions

Recent interpretability frameworks suggest that most biological concepts are encoded in ESM-2 as high-dimensional superpositions; overcomplete sparse autoencoders with L1 regularization (SAEs) are required to isolate disentangled, monosemantic representations aligned to functional and structural motifs (Simon et al., 2024, Garcia et al., 13 Feb 2025, Tsui et al., 25 Aug 2025). Empirical and statistical analyses confirm that only a minority of raw ESM-2 neurons clearly align to known biological features, but with sparse decompositions, thousands of biological concepts are recoverable, enabling downstream sequence generation to be explicitly motif-conditioned for design.

Extensions such as structure-alignment (SaESM2), quantized/long sequence models, modal-bridging adapters (PLM-X), and combined local-global pretraining objectives (community propagation, contrastive learning) are actively increasing the fidelity and utility of ESM-2 embeddings. These advances are propelling a shift from raw token-level statistics to structurally and functionally interpretable, efficient, and multimodal protein representations capable of driving the next generation of protein design, annotation, and engineering pipelines.