AbGen Benchmark for Antibody Design
- The paper introduces AbGen Benchmark as a comprehensive suite evaluating antibody-antigen complex models using biologically grounded, function-centric metrics.
- AbGen Benchmark is a multi-antigen evaluation framework that integrates zero-shot prediction and generative design to rank high-affinity antibody variants.
- It employs rigorous metrics like likelihood–affinity correlation and precision@10 across diverse datasets to drive advancements in therapeutic antibody design.
The AbGen Benchmark is a comprehensive, multi-antigen evaluation suite for the assessment of computational models in antibody binding affinity maturation and design, with an explicit focus on biologically grounded, function-centric metrics. Unlike previous antibody benchmarks that assess isolated antibody properties such as amino acid identity recovery or root-mean-square deviation (RMSD), AbGen (formally AbBiBench) treats the antibody-antigen (Ab–Ag) complex as the fundamental unit of evaluation. This approach emphasizes both zero-shot predictive scoring of binding affinity and the generative design and ranking of antibody variants with enhanced binding properties (Zhao et al., 23 May 2025).
1. Overview and Motivation
AbGen is constructed to evaluate and compare machine learning models for antibody binding affinity and design by quantifying their ability to discriminate high-affinity Ab–Ag complexes and to generate new high-affinity binders. This framework encompasses both zero-shot prediction—using log-likelihood assignment to structurally validated complexes—and generative design, where the ability to propose optimized antibody complementarity-determining region (CDR) variants is tested within the context of an assembled Ab–Ag structure. Evaluation is based on direct correlation to experimentally measured affinities, precision in ranking binders, and success in multi-objective candidate selection.
2. Datasets and Preprocessing
AbGen curates 11 high-throughput affinity-assay datasets covering 9 antigens, including viral (influenza H1N1, H3N2, H9N2; SARS-CoV-2 spike), cancer-related (HER2, VEGF), and other targets (hen-egg-white lysozyme, integrin-α1). Collectively, 155,853 heavy chain (vH) mutated antibody sequences are standardized and evaluated. Dataset selection criteria include:
- Retaining only vH sequences with ≥20 variants per study.
- Affinity is uniformly defined as if values (dissociation constant) are available, or as (enrichment) for enrichment-based assays, standardizing “higher is better” across studies.
- Structural mapping places each variant into the wild-type Ab–Ag structural context by in silico side-chain replacement. For the generative case study, AlphaFold-3 is employed to predict novel complex structures.
- Additional biophysical baseline metrics, such as FoldX-calculated binding free energy () and epitope solvent-accessible surface area (SASA), are routinely computed.
| Dataset | # Variants | Affinity Metric |
|---|---|---|
| CR9114–H1 (4fqi_h1) | 65,094 | |
| CR9114–H3 (4fqi_h3) | 65,535 | |
| CR6261–H1 (3gbn_h1) | 1,887 | |
| CR6261–H9 (3gbn_h9) | 1,842 | |
| AAYL49 | 4,312 | |
| AAYL49_ML | 8,953 | |
| AAYL51 | 4,320 | |
| G6.31 (VEGF) | 2,223 | enrichment |
| D44.1 (lysozyme) | 1,297 | enrichment |
| Trastuzumab (HER2) | 419 | |
| AQC2 (Integrin-α1) | 40 |
3. Model Categories and Inference Protocols
Fourteen distinct models are systematically evaluated, grouped by machine learning paradigm and structural conditioning:
- Masked LLMs (MLM): ESM-2, AntiBERTy, SaProt, ProSST, CurrAb. These models leverage a masked-token prediction objective, using only sequence or sequence plus local structural features.
- Autoregressive LLMs: ProGen2, utilizing left-to-right sequence generation.
- Inverse Folding Models: ProteinMPNN, ESM-IF, AntiFold (ESM-IF fine-tuned). These are SE(3)-equivariant models that generate sequences given full Ab–Ag structure, thus explicitly encoding intermolecular context.
- Diffusion-Based Generative Models: DiffAb, AbDiffuser, IgDiff, which operate by simulating timetable de-noising or generative dynamics for structural coordinates and sequences.
- Geometric Graph Models: MEAN, dyMEAN, graph-based networks for masked CDR completion using full atom/residue-level context.
At inference, each model is required to score the likelihood of an Ab–Ag pair, with task-specific adjustments for MLMs and generative models to yield harmonized log-likelihood or equivalent model scores. Diffusion and graph models marginalize over latent noise.
4. Evaluation Metrics and Quantitative Criteria
The benchmark employs a set of correlated and absolute quantitative metrics:
- Likelihood–Affinity Correlation: Both Spearman’s (rank) and Pearson’s (linear) correlation between model-computed log-likelihood and experimental affinity (across variants), with p-values for statistical significance.
- Precision@10: Fraction of top-10 model-ranked antibodies whose experimentally measured affinity exceeds 5× wild-type; reported across five random splits.
- Generative Screening:
- Phase 1: Structure-free metrics including AntiBERTy log-likelihood (sequence plausibility) and FoldX (relative binding energy improvement).
- Phase 2: Structure-based metrics post-refolding with AlphaFold-3, complex pLDDT (interface confidence), epitope SASA, and ProteinMPNN foldability scores. Final selection is based on Pareto-optimal balance across all metrics.
5. Empirical Results and Model Ranking
Benchmark results reveal systematic trends:
- Inverse folding models (ProteinMPNN, ESM-IF, AntiFold) achieve the highest mean Spearman correlation (–$0.50$) and superior precision@10 ($30$–), underscoring the value of explicit structural conditioning.
- Structure-aware MLMs (SaProt) provide intermediate performance (), outperforming pure sequence-based LMs (ESM-2, AntiBERTy, ProGen2: ).
- In the generative F045-092 to H1N1 case study, ESM-IF and SaProt variants achieve more favorable shifts (–29, –21 kcal/mol), higher complex pLDDT (83), and maintain sequence plausibility. DiffAb and MEAN demonstrate lower improvements or more limited candidate diversity.
- The Pareto front of 18 final CDR-H3 designs exhibits high interface confidence and predicted affinity.
Summary Table (selected metrics):
| Model | Avg Spearman | Avg p@10 | Case | Case pLDDT |
|---|---|---|---|---|
| ProteinMPNN | 0.48 | 0.45 | n/a | n/a |
| ESM-IF | 0.42 | 0.40 | –29.3 | 82.8 |
| SaProt | 0.26 | 0.20 | –20.8 | 83.2 |
| MEAN | 0.15 | 0.12 | –13.9 | 81.0 |
| DiffAb | 0.05 | 0.03 | +2.0 | 79.5 |
6. Interpretation and Biological Significance
AbGen’s complex-level benchmarks expose the limitations of sequence-only modeling for practical antibody design. Models that condition on explicit antigen structure—using message-passing, equivariant neural architectures, or geometric graphs—better recapitulate experimental binding data and propose plausible, high-affinity variants. Structure-aware models also provide natural regularization against spurious sequence artifacts and enable integration of additional developability or biophysical criteria. This suggests that evaluating generative and predictive models in the Ab–Ag context provides more meaningful assessments for real-world engineering tasks.
7. Limitations and Prospective Directions
Current limitations of AbGen include variable dataset sizes (e.g., integrin-α1 datasets with minimal variants), absence of functional neutralization or in vivo potency readouts for generated designs, and the lack of explicit modeling for light-chain contributions or full IgG context. Planned directions encompass:
- Expanding antigen/assay diversity (e.g., including GPCRs, cytokines).
- Integrating light-chain and full-antibody conformational modeling.
- Introducing developability metrics (aggregation propensity, immunogenicity).
- Testing transfer from in silico to functional and biophysical validation.
AbGen thus provides an integrated, biologically informed foundation for the benchmarking and progress measurement of antibody affinity modeling and design, with direct implications for therapeutic antibody discovery and immune engineering (Zhao et al., 23 May 2025).