Papers
Topics
Authors
Recent
Search
2000 character limit reached

AbGen Benchmark for Antibody Design

Updated 20 February 2026
  • The paper introduces AbGen Benchmark as a comprehensive suite evaluating antibody-antigen complex models using biologically grounded, function-centric metrics.
  • AbGen Benchmark is a multi-antigen evaluation framework that integrates zero-shot prediction and generative design to rank high-affinity antibody variants.
  • It employs rigorous metrics like likelihood–affinity correlation and precision@10 across diverse datasets to drive advancements in therapeutic antibody design.

The AbGen Benchmark is a comprehensive, multi-antigen evaluation suite for the assessment of computational models in antibody binding affinity maturation and design, with an explicit focus on biologically grounded, function-centric metrics. Unlike previous antibody benchmarks that assess isolated antibody properties such as amino acid identity recovery or root-mean-square deviation (RMSD), AbGen (formally AbBiBench) treats the antibody-antigen (Ab–Ag) complex as the fundamental unit of evaluation. This approach emphasizes both zero-shot predictive scoring of binding affinity and the generative design and ranking of antibody variants with enhanced binding properties (Zhao et al., 23 May 2025).

1. Overview and Motivation

AbGen is constructed to evaluate and compare machine learning models for antibody binding affinity and design by quantifying their ability to discriminate high-affinity Ab–Ag complexes and to generate new high-affinity binders. This framework encompasses both zero-shot prediction—using log-likelihood assignment to structurally validated complexes—and generative design, where the ability to propose optimized antibody complementarity-determining region (CDR) variants is tested within the context of an assembled Ab–Ag structure. Evaluation is based on direct correlation to experimentally measured affinities, precision in ranking binders, and success in multi-objective candidate selection.

2. Datasets and Preprocessing

AbGen curates 11 high-throughput affinity-assay datasets covering 9 antigens, including viral (influenza H1N1, H3N2, H9N2; SARS-CoV-2 spike), cancer-related (HER2, VEGF), and other targets (hen-egg-white lysozyme, integrin-α1). Collectively, 155,853 heavy chain (vH) mutated antibody sequences are standardized and evaluated. Dataset selection criteria include:

  • Retaining only vH sequences with ≥20 variants per study.
  • Affinity is uniformly defined as logKd-\log K_d if KdK_d values (dissociation constant) are available, or as log\log(enrichment) for enrichment-based assays, standardizing “higher is better” across studies.
  • Structural mapping places each variant into the wild-type Ab–Ag structural context by in silico side-chain replacement. For the generative case study, AlphaFold-3 is employed to predict novel complex structures.
  • Additional biophysical baseline metrics, such as FoldX-calculated binding free energy (ΔG-\Delta G) and epitope solvent-accessible surface area (SASA), are routinely computed.
Dataset # Variants Affinity Metric
CR9114–H1 (4fqi_h1) 65,094 logKd-\log K_d
CR9114–H3 (4fqi_h3) 65,535 logKd-\log K_d
CR6261–H1 (3gbn_h1) 1,887 logKd-\log K_d
CR6261–H9 (3gbn_h9) 1,842 logKd-\log K_d
AAYL49 4,312 logKd-\log K_d
AAYL49_ML 8,953 logKd-\log K_d
AAYL51 4,320 logKd-\log K_d
G6.31 (VEGF) 2,223 log\log enrichment
D44.1 (lysozyme) 1,297 log\log enrichment
Trastuzumab (HER2) 419 logKd-\log K_d
AQC2 (Integrin-α1) 40 logKd-\log K_d

3. Model Categories and Inference Protocols

Fourteen distinct models are systematically evaluated, grouped by machine learning paradigm and structural conditioning:

  • Masked LLMs (MLM): ESM-2, AntiBERTy, SaProt, ProSST, CurrAb. These models leverage a masked-token prediction objective, using only sequence or sequence plus local structural features.
  • Autoregressive LLMs: ProGen2, utilizing left-to-right sequence generation.
  • Inverse Folding Models: ProteinMPNN, ESM-IF, AntiFold (ESM-IF fine-tuned). These are SE(3)-equivariant models that generate sequences given full Ab–Ag structure, thus explicitly encoding intermolecular context.
  • Diffusion-Based Generative Models: DiffAb, AbDiffuser, IgDiff, which operate by simulating timetable de-noising or generative dynamics for structural coordinates and sequences.
  • Geometric Graph Models: MEAN, dyMEAN, graph-based networks for masked CDR completion using full atom/residue-level context.

At inference, each model is required to score the likelihood L(xM)\mathcal{L}(\mathbf{x}|M) of an Ab–Ag pair, with task-specific adjustments for MLMs and generative models to yield harmonized log-likelihood or equivalent model scores. Diffusion and graph models marginalize over latent noise.

4. Evaluation Metrics and Quantitative Criteria

The benchmark employs a set of correlated and absolute quantitative metrics:

  • Likelihood–Affinity Correlation: Both Spearman’s ρ\rho (rank) and Pearson’s rr (linear) correlation between model-computed log-likelihood L\mathcal{L} and experimental affinity AA (across nn variants), with p-values for statistical significance.
  • Precision@10: Fraction of top-10 model-ranked antibodies whose experimentally measured affinity exceeds 5× wild-type; reported across five random splits.
  • Generative Screening:
    • Phase 1: Structure-free metrics including AntiBERTy log-likelihood (sequence plausibility) and FoldX ΔΔG\Delta\Delta G (relative binding energy improvement).
    • Phase 2: Structure-based metrics post-refolding with AlphaFold-3, complex pLDDT (interface confidence), epitope Δ\DeltaSASA, and ProteinMPNN foldability scores. Final selection is based on Pareto-optimal balance across all metrics.

5. Empirical Results and Model Ranking

Benchmark results reveal systematic trends:

  • Inverse folding models (ProteinMPNN, ESM-IF, AntiFold) achieve the highest mean Spearman correlation (ρ0.35\rho\approx0.35–$0.50$) and superior precision@10 ($30$–45%45\%), underscoring the value of explicit structural conditioning.
  • Structure-aware MLMs (SaProt) provide intermediate performance (ρ0.25\rho\approx0.25), outperforming pure sequence-based LMs (ESM-2, AntiBERTy, ProGen2: ρ<0.10\rho<0.10).
  • In the generative F045-092 to H1N1 case study, ESM-IF and SaProt variants achieve more favorable ΔΔG\Delta\Delta G shifts (–29, –21 kcal/mol), higher complex pLDDT (\sim83), and maintain sequence plausibility. DiffAb and MEAN demonstrate lower improvements or more limited candidate diversity.
  • The Pareto front of 18 final CDR-H3 designs exhibits high interface confidence and predicted affinity.

Summary Table (selected metrics):

Model Avg Spearman ρ\rho Avg p@10 Case ΔΔG\Delta\Delta G Case pLDDT
ProteinMPNN 0.48 0.45 n/a n/a
ESM-IF 0.42 0.40 –29.3 82.8
SaProt 0.26 0.20 –20.8 83.2
MEAN 0.15 0.12 –13.9 81.0
DiffAb 0.05 0.03 +2.0 79.5

6. Interpretation and Biological Significance

AbGen’s complex-level benchmarks expose the limitations of sequence-only modeling for practical antibody design. Models that condition on explicit antigen structure—using message-passing, equivariant neural architectures, or geometric graphs—better recapitulate experimental binding data and propose plausible, high-affinity variants. Structure-aware models also provide natural regularization against spurious sequence artifacts and enable integration of additional developability or biophysical criteria. This suggests that evaluating generative and predictive models in the Ab–Ag context provides more meaningful assessments for real-world engineering tasks.

7. Limitations and Prospective Directions

Current limitations of AbGen include variable dataset sizes (e.g., integrin-α1 datasets with minimal variants), absence of functional neutralization or in vivo potency readouts for generated designs, and the lack of explicit modeling for light-chain contributions or full IgG context. Planned directions encompass:

  • Expanding antigen/assay diversity (e.g., including GPCRs, cytokines).
  • Integrating light-chain and full-antibody conformational modeling.
  • Introducing developability metrics (aggregation propensity, immunogenicity).
  • Testing transfer from in silico to functional and biophysical validation.

AbGen thus provides an integrated, biologically informed foundation for the benchmarking and progress measurement of antibody affinity modeling and design, with direct implications for therapeutic antibody discovery and immune engineering (Zhao et al., 23 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AbGen Benchmark.