Papers
Topics
Authors
Recent
Search
2000 character limit reached

ESM-2 Protein Embeddings Overview

Updated 11 February 2026
  • ESM-2 Protein Embeddings are high-dimensional representations derived from transformer architectures trained on millions of protein sequences, enabling both residue-level and sequence-level analysis.
  • They are extracted through tokenization, forward passes, and pooling methods, supporting tasks like structure quality assessment, binding site localization, and function classification.
  • These embeddings enhance downstream applications such as ligand binding prediction, homology detection, and protein design, providing scalable and interpretable features for advanced bioinformatics.

ESM-2 Protein Embeddings provide high-dimensional, learned representations of protein sequences via large-scale transformer architectures optimized for biological sequence modeling. Building upon the Evolutionary Scale Modeling (ESM) family, ESM-2 embeddings enable protein-level and residue-level feature extraction that has demonstrated superiority across a range of structural and functional prediction tasks, including structure quality assessment, binding site localization, function classification, sequence generation, and mechanistic interpretability. These embeddings have established themselves as a foundational component in protein informatics pipelines, enabling efficient, scalable, and interpretable analysis of protein sequence space.

1. ESM-2 Model Architecture and Embedding Extraction

ESM-2 models are BERT-style encoder-only transformers, pre-trained on hundreds of millions of protein sequences using masked language modeling (MLM). The architecture includes several size variants, ranging from 6-layer, 8M-parameter models (embedding dimension d=320d=320) to 48-layer, 15B-parameter models (d=2560d=2560), with the most commonly used configurations being ESM2-650M (33 or 34 layers, d=1280d=1280) and ESM2-150M (30 layers, d=640d=640) (Chae et al., 2024, Zhang et al., 2023, Simon et al., 2024). Each amino acid token, along with special tokens (<cls>, <eos>), is embedded and augmented with learned or sinusoidal positional encodings. The model produces, at each transformer block, residue-level embedding matrices H()RL×dH^{(\ell)} \in \mathbb{R}^{L \times d}, where LL is the sequence length.

Embedding extraction typically entails:

  1. Tokenization: mapping residues to integer indices using the ESM-2 vocabulary.
  2. Forward pass: inputting the token sequence through the pre-trained transformer stack.
  3. Output selection: retrieving either per-residue embeddings from the final (e.g., H(34)H^{(34)} for ESM2-650M or H(6)H^{(6)} for ESM2-8M) or pooled per-sequence embeddings via mean-pooling, [CLS] token, or sliding window strategies.
  4. Special cases: for long sequences (L>1022L>1022), sliding window with length-weighted averaging is applied; domain-level extraction (e.g., using Pfam annotations) may be used to isolate regions of functional interest (Kumar et al., 29 Nov 2025, Oliveira et al., 13 Jan 2025).

The extracted embeddings are high-dimensional (d=320d=320–$1280$ per residue) and can be used directly in downstream architectures, or as inputs for further feature engineering, interpretation, or model fine-tuning.

2. Downstream Applications and Performance Benchmarks

Structural Prediction and Protein Screening

ESM-2 embeddings serve as potent inputs for downstream structure-prediction surrogates. For example, the pLDDT-Predictor utilizes precomputed ESM2-t6-8M-UR50D embeddings (d=320d=320) as input to a dedicated 6-layer transformer (hidden size h=1024h=1024, 8 attention heads) to predict residue-level pLDDT quality scores in milliseconds—achieving a Pearson correlation of 0.7891 to AlphaFold2-derived pLDDT and 91.2%91.2\% accuracy for high-confidence structure classification, all at a 2.5×1052.5\times10^5 speedup over full structure models (Chae et al., 2024).

Ligand Binding Site and Interaction Prediction

The LaMPSite pipeline leverages ESM-2-650M to provide per-residue embeddings and unsupervised contact maps (computed from averaged attention matrices of the top layers). These representations enable accurate protein–ligand binding site predictions without the need for 3D structural input, achieving competitive performance with structure-based methods and facilitating efficient large-scale bioactivity screening (Zhang et al., 2023).

Function Classification and Homology Detection

Mean-pooled or domain-specific ESM-2 embeddings are effective for protein family/interface classification, enzyme/non-enzyme prediction, and remote homology detection (Sajol et al., 24 Oct 2025, Xu, 4 Dec 2025). In kinase functional prediction, averaging embeddings over mid-to-late layers (=20\ell = 20–$33$) rather than relying solely on the final layer significantly improves adjusted Rand index (ARI) in unsupervised clustering (from $0.268$ to $0.354$) and boosts supervised accuracy from 70.2%70.2\% (final layer) to 75.7%75.7\% (Kumar et al., 29 Nov 2025).

Sequence Generation and Protein Design

ESM-2 latent representations are well suited to both protein design and generative modeling. For example, DiMA uses per-residue ESM-2 embeddings (d=320d=320) as the latent space for a continuous diffusion model, achieving superior sequence quality, structural diversity, and distributional fidelity over autoregressive and discrete-diffusion baselines on benchmarks such as SwissProt and AFDBv4 (Meshchaninov et al., 2024).

Fine-tuned sparse autoencoders (SAEs) on ESM-2 embeddings enable model steering for targeted protein design. For low-N function prediction, SAE-transformed ESM2-650M embeddings (dSAE=4096d_{SAE}=4096, sparsity k=128k=128) generalize effectively from 24 labeled sequences—yielding higher Spearman’s ρ\rho vs direct ESM-2 baselines and enabling top-variant steering in 83%83\% of engineering cases (Tsui et al., 25 Aug 2025).

3. Interpretability, Sparse Representations, and Adapter Models

Sparse Autoencoders and Interpretable Features

InterPLM demonstrates that SAEs trained on ESM-2 embeddings can disentangle opaquely superposed biological concepts, discovering up to 2,548 nearly monosemantic features per layer from a base space of 320 (for ESM-2-8M) (Simon et al., 2024). These features correspond to motifs (e.g., kinase active-sites, Nudix boxes), domains (e.g., trypsin peptidase S1), and binding sites, aligning to 143 distinct Swiss-Prot-annotated concepts, versus only 15 recovered by single ESM-2 neurons. Similarly, (Garcia et al., 13 Feb 2025) finds that sparse latents extracted at layer 3 can be statistically associated with protein features (e.g., transmembrane region, zinc finger), and controlled directly to steer sequence generation toward nontrivial structural classes (e.g., generating zinc-finger folds).

Subspace Factorization Adapters

PLM-eXplain (PLM-X) introduces a post-hoc adapter that decomposes ESM-2 embeddings (e.g., d=480d=480 for ESM2-t12-35M) into an interpretable subspace (34 dimensions, e.g., secondary structure, amino-acid identity, hydrophobicity, ASA) and a residual subspace capturing un-modeled variance. The interpretable subspace enables SHAP-based and filter-based biological insight, while the full factorization matches non-adapted ESM-2 in downstream prediction (aggregation, transmembrane helix, EV association), substantially outperforming pure handcrafted baselines (Eck et al., 9 Apr 2025).

4. Model Extensions: Structure Alignment, Quantization, and Long Sequence Handling

Structure-Aligned ESM-2 Embeddings

SaESM2 aligns transformer-derived representations with residue-level protein graph neural network (pGNN) embeddings (e.g., GearNet), via a latent-level InfoNCE loss and a physical-level structure-token prediction task, selecting informative residues via excess-loss ranking. This pairing enhances structural knowledge, yielding a 12.7%12.7\% increase in contact prediction precision (P@L/5: $54.14$ leap to $61.02$), improved fold and secondary-structure accuracy, and amplification of mutational effect prediction (ρ\rho up to $0.951$ for fitness) without sacrificing biochemical modeling (Chen et al., 22 May 2025).

Quantization and Long Sequence Variants

To address sequence length, memory, and efficiency limits, ESM-2 architectures have been extended: "long" ESM-2 supports input up to $2,048$ residues via "context copying" positional embedding and banded local self-attention, while 4-bit (int4) quantized weights offer $4-8$× memory reduction. For proteins >1,024>1,024 residues, these modified models hold or gain in functional prediction accuracy (FmaxF_{\max}), particularly in GO branches BPO, CCO, MFO, matching or exceeding standard ESM-2 (Oliveira et al., 13 Jan 2025).

5. Technical Best Practices and Integration in Modern Pipelines

Embedding Selection and Pooling

  • For protein-level tasks: mean-pooling over residues, potentially across selected layers ($20$–$33$) or Pfam-annotated domains, is superior (Kumar et al., 29 Nov 2025).
  • For residue-level tasks: preserve the full L×dL \times d matrix from the relevant ESM-2 layer; avoid PCA or normalization unless required by a downstream model (Zeng et al., 2023).
  • For sequences exceeding the transformer's nominal limit: Apply sliding window processing with length-weighted averaging (Kumar et al., 29 Nov 2025, Oliveira et al., 13 Jan 2025).

Model Integration

ESM-2 embeddings are now integral to graph-based PPI prediction frameworks, such as DeepDrug Protein Embedding Bank (DPEB), providing context-rich node attributes for GNN-based link prediction, enzyme classification, and family identification on over 22,000 human proteins (Sajol et al., 24 Oct 2025). In nucleic acid-binding residue prediction, ESM2-650M outperforms traditional HMM-derived features both in accuracy (up to +90.7% improvement in MCC/AUC/AP) and speed (order-of-magnitude faster, e.g., 5.52 s for L=500L=500) (Zeng et al., 2023).

Semi-supervised and Low-label Regimes

ESM-2 embeddings afford exceptional label efficiency: in antigenicity prediction for influenza A, models trained with only 25% labeled data achieve F1>0.82F_1>0.82 for all major subtypes, robustly exceeding alternatives (ProtVec, ProtBert, ProtT5) and further benefiting from self-training/label-spreading semi-supervised strategies (Xu, 4 Dec 2025). In low-N protein function prediction (N=24N=24), sparse-latent versions generalize more effectively to unseen mutants and facilitate motif-focused design (Tsui et al., 25 Aug 2025).

Table: Representative ESM-2 Downstream Performance Metrics

Task/Domain ESM-2 Variant / Approach Metric/Value Reference
High-throughput pLDDT screening ESM2-t6-8M + 6L Transformer Pearson r=0.7891r = 0.7891 (Chae et al., 2024)
Ligand-binding site prediction ESM2-650M + LaMPSite No 3D req, state-of-art (Zhang et al., 2023)
Kinase function prediction ESM2-650M, layers 20–33 pooled Accuracy 75.7%75.7\% (Kumar et al., 29 Nov 2025)
Antigenicity (supervised, 25%25\%) ESM2-150M mean-pooled F1=0.845F_1 = 0.845 (Xu, 4 Dec 2025)
Protein family classification ESM2-12L, community clustering 31.08%31.08\% accuracy (Jiao et al., 2024)
Enzyme/non-enzyme (GNN/DPEB) ESM2-650M, mean-pool 73.6%73.6\% accuracy (Sajol et al., 24 Oct 2025)
Structural contact prediction SaESM2 (ESM2+GNN) P@L/5 $61.02$ (Chen et al., 22 May 2025)
PPI link prediction ESM2-650M + GraphSAGE AUROC 84.26%84.26\% (Sajol et al., 24 Oct 2025)
Nucleic-acid binding (MCC, DNA) ESM2-650M + BiLSTM/MLP $0.534$ (Zeng et al., 2023)

6. Interpretations, Limitations, and Emerging Directions

Recent interpretability frameworks suggest that most biological concepts are encoded in ESM-2 as high-dimensional superpositions; overcomplete sparse autoencoders with L1 regularization (SAEs) are required to isolate disentangled, monosemantic representations aligned to functional and structural motifs (Simon et al., 2024, Garcia et al., 13 Feb 2025, Tsui et al., 25 Aug 2025). Empirical and statistical analyses confirm that only a minority of raw ESM-2 neurons clearly align to known biological features, but with sparse decompositions, thousands of biological concepts are recoverable, enabling downstream sequence generation to be explicitly motif-conditioned for design.

Extensions such as structure-alignment (SaESM2), quantized/long sequence models, modal-bridging adapters (PLM-X), and combined local-global pretraining objectives (community propagation, contrastive learning) are actively increasing the fidelity and utility of ESM-2 embeddings. These advances are propelling a shift from raw token-level statistics to structurally and functionally interpretable, efficient, and multimodal protein representations capable of driving the next generation of protein design, annotation, and engineering pipelines.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ESM-2 Protein Embeddings.