AbGen Benchmark for Antibody Design

Updated 20 February 2026

The paper introduces AbGen Benchmark as a comprehensive suite evaluating antibody-antigen complex models using biologically grounded, function-centric metrics.
AbGen Benchmark is a multi-antigen evaluation framework that integrates zero-shot prediction and generative design to rank high-affinity antibody variants.
It employs rigorous metrics like likelihood–affinity correlation and precision@10 across diverse datasets to drive advancements in therapeutic antibody design.

The AbGen Benchmark is a comprehensive, multi-antigen evaluation suite for the assessment of computational models in antibody binding affinity maturation and design, with an explicit focus on biologically grounded, function-centric metrics. Unlike previous antibody benchmarks that assess isolated antibody properties such as amino acid identity recovery or root-mean-square deviation (RMSD), AbGen (formally AbBiBench) treats the antibody-antigen (Ab–Ag) complex as the fundamental unit of evaluation. This approach emphasizes both zero-shot predictive scoring of binding affinity and the generative design and ranking of antibody variants with enhanced binding properties (Zhao et al., 23 May 2025).

1. Overview and Motivation

AbGen is constructed to evaluate and compare machine learning models for antibody binding affinity and design by quantifying their ability to discriminate high-affinity Ab–Ag complexes and to generate new high-affinity binders. This framework encompasses both zero-shot prediction—using log-likelihood assignment to structurally validated complexes—and generative design, where the ability to propose optimized antibody complementarity-determining region (CDR) variants is tested within the context of an assembled Ab–Ag structure. Evaluation is based on direct correlation to experimentally measured affinities, precision in ranking binders, and success in multi-objective candidate selection.

2. Datasets and Preprocessing

AbGen curates 11 high-throughput affinity-assay datasets covering 9 antigens, including viral (influenza H1N1, H3N2, H9N2; SARS-CoV-2 spike), cancer-related (HER2, VEGF), and other targets (hen-egg-white lysozyme, integrin-α1). Collectively, 155,853 heavy chain (vH) mutated antibody sequences are standardized and evaluated. Dataset selection criteria include:

Retaining only vH sequences with ≥20 variants per study.
Affinity is uniformly defined as $-\log K_d$ if $K_d$ values (dissociation constant) are available, or as $\log$ (enrichment) for enrichment-based assays, standardizing “higher is better” across studies.
Structural mapping places each variant into the wild-type Ab–Ag structural context by in silico side-chain replacement. For the generative case study, AlphaFold-3 is employed to predict novel complex structures.
Additional biophysical baseline metrics, such as FoldX-calculated binding free energy ( $-\Delta G$ ) and epitope solvent-accessible surface area (SASA), are routinely computed.

Dataset	# Variants	Affinity Metric
CR9114–H1 (4fqi_h1)	65,094	$-\log K_d$
CR9114–H3 (4fqi_h3)	65,535	$-\log K_d$
CR6261–H1 (3gbn_h1)	1,887	$-\log K_d$
CR6261–H9 (3gbn_h9)	1,842	$-\log K_d$
AAYL49	4,312	$-\log K_d$
AAYL49_ML	8,953	$-\log K_d$
AAYL51	4,320	$-\log K_d$
G6.31 (VEGF)	2,223	$\log$ enrichment
D44.1 (lysozyme)	1,297	$\log$ enrichment
Trastuzumab (HER2)	419	$-\log K_d$
AQC2 (Integrin-α1)	40	$-\log K_d$

3. Model Categories and Inference Protocols

Fourteen distinct models are systematically evaluated, grouped by machine learning paradigm and structural conditioning:

Masked LLMs (MLM): ESM-2, AntiBERTy, SaProt, ProSST, CurrAb. These models leverage a masked-token prediction objective, using only sequence or sequence plus local structural features.
Autoregressive LLMs: ProGen2, utilizing left-to-right sequence generation.
Inverse Folding Models: ProteinMPNN, ESM-IF, AntiFold (ESM-IF fine-tuned). These are SE(3)-equivariant models that generate sequences given full Ab–Ag structure, thus explicitly encoding intermolecular context.
Diffusion-Based Generative Models: DiffAb, AbDiffuser, IgDiff, which operate by simulating timetable de-noising or generative dynamics for structural coordinates and sequences.
Geometric Graph Models: MEAN, dyMEAN, graph-based networks for masked CDR completion using full atom/residue-level context.

At inference, each model is required to score the likelihood $\mathcal{L}(\mathbf{x}|M)$ of an Ab–Ag pair, with task-specific adjustments for MLMs and generative models to yield harmonized log-likelihood or equivalent model scores. Diffusion and graph models marginalize over latent noise.

4. Evaluation Metrics and Quantitative Criteria

The benchmark employs a set of correlated and absolute quantitative metrics:

Likelihood–Affinity Correlation: Both Spearman’s $\rho$ (rank) and Pearson’s $r$ (linear) correlation between model-computed log-likelihood $\mathcal{L}$ and experimental affinity $A$ (across $n$ variants), with p-values for statistical significance.
Precision@10: Fraction of top-10 model-ranked antibodies whose experimentally measured affinity exceeds 5× wild-type; reported across five random splits.
Generative Screening:
- Phase 1: Structure-free metrics including AntiBERTy log-likelihood (sequence plausibility) and FoldX $\Delta\Delta G$ (relative binding energy improvement).
- Phase 2: Structure-based metrics post-refolding with AlphaFold-3, complex pLDDT (interface confidence), epitope $\Delta$ SASA, and ProteinMPNN foldability scores. Final selection is based on Pareto-optimal balance across all metrics.

5. Empirical Results and Model Ranking

Benchmark results reveal systematic trends:

Inverse folding models (ProteinMPNN, ESM-IF, AntiFold) achieve the highest mean Spearman correlation ( $\rho\approx0.35$ –$0.50$) and superior precision@10 ($30$– $45\%$ ), underscoring the value of explicit structural conditioning.
Structure-aware MLMs (SaProt) provide intermediate performance ( $\rho\approx0.25$ ), outperforming pure sequence-based LMs (ESM-2, AntiBERTy, ProGen2: $\rho<0.10$ ).
In the generative F045-092 to H1N1 case study, ESM-IF and SaProt variants achieve more favorable $\Delta\Delta G$ shifts (–29, –21 kcal/mol), higher complex pLDDT ( $\sim$ 83), and maintain sequence plausibility. DiffAb and MEAN demonstrate lower improvements or more limited candidate diversity.
The Pareto front of 18 final CDR-H3 designs exhibits high interface confidence and predicted affinity.

Summary Table (selected metrics):

Model	Avg Spearman $\rho$	Avg p@10	Case $\Delta\Delta G$	Case pLDDT
ProteinMPNN	0.48	0.45	n/a	n/a
ESM-IF	0.42	0.40	–29.3	82.8
SaProt	0.26	0.20	–20.8	83.2
MEAN	0.15	0.12	–13.9	81.0
DiffAb	0.05	0.03	+2.0	79.5

6. Interpretation and Biological Significance

AbGen’s complex-level benchmarks expose the limitations of sequence-only modeling for practical antibody design. Models that condition on explicit antigen structure—using message-passing, equivariant neural architectures, or geometric graphs—better recapitulate experimental binding data and propose plausible, high-affinity variants. Structure-aware models also provide natural regularization against spurious sequence artifacts and enable integration of additional developability or biophysical criteria. This suggests that evaluating generative and predictive models in the Ab–Ag context provides more meaningful assessments for real-world engineering tasks.

7. Limitations and Prospective Directions

Current limitations of AbGen include variable dataset sizes (e.g., integrin-α1 datasets with minimal variants), absence of functional neutralization or in vivo potency readouts for generated designs, and the lack of explicit modeling for light-chain contributions or full IgG context. Planned directions encompass:

Expanding antigen/assay diversity (e.g., including GPCRs, cytokines).
Integrating light-chain and full-antibody conformational modeling.
Introducing developability metrics (aggregation propensity, immunogenicity).
Testing transfer from in silico to functional and biophysical validation.

AbGen thus provides an integrated, biologically informed foundation for the benchmarking and progress measurement of antibody affinity modeling and design, with direct implications for therapeutic antibody discovery and immune engineering (Zhao et al., 23 May 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Benchmark for Antibody Binding Affinity Maturation and Design (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AbGen Benchmark.

AbGen Benchmark for Antibody Design

1. Overview and Motivation

2. Datasets and Preprocessing

3. Model Categories and Inference Protocols

4. Evaluation Metrics and Quantitative Criteria

5. Empirical Results and Model Ranking

6. Interpretation and Biological Significance

7. Limitations and Prospective Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

AbGen Benchmark for Antibody Design

1. Overview and Motivation

2. Datasets and Preprocessing

3. Model Categories and Inference Protocols

4. Evaluation Metrics and Quantitative Criteria

5. Empirical Results and Model Ranking

6. Interpretation and Biological Significance

7. Limitations and Prospective Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research