PepMerge Benchmark: Peptide & LLM Merging

Updated 14 January 2026

The paper introduces two benchmark suites: one for peptide–protein complex design and another for LLM merging, set up for rigorous cross-method evaluation.
The evaluation protocol employs precise geometry, energy, and design metrics to measure generative model performance in structural and functional accuracy.
The surrogate SMM-Bench provides rapid, cost-effective hyperparameter optimization for LLM merging, accelerating research in neural model surgery.

PepMerge Benchmark refers to two distinct but contextually related benchmark suites in the recent literature: (1) a comprehensive structural benchmark for peptide–protein complex design and evaluation in computational biology, and (2) a surrogate benchmark (“SMM-Bench”) for hyperparameter optimization in the context of LLM merging. Both benchmarks are notable for their rigorous curation, amenability to broad algorithmic evaluation, and growing influence in their respective research communities.

1. Benchmark for Peptide–Protein Complex Design

The PepMerge benchmark in molecular design is an 8,365-entry non-redundant corpus of peptide–protein complexes, designed to support the evaluation and comparison of generative models for targeted peptide design. The dataset is assembled from PepBDB and Q-BioLip, filtered to high-quality entries (X-ray structures at better than 4 Å resolution, peptide length 3–25 residues), and redundancy-reduced by 40% sequence identity clustering on the receptor using MMseqs2. This clustering yields 292 clusters, of which 10 clusters (158 complexes) are reserved as a test split, and the remaining 282 support model training and validation (“the precise split between train and val follows Li et al. 2024”) (Wu et al., 8 Jan 2026).

2. Evaluation Metrics and Protocol

PepMerge provides a unified evaluation protocol using geometry, energy, and design metrics. These metrics are explicitly defined and enable direct comparison across generative approaches.

Geometry Metrics:
- Amino Acid Recovery Rate (AAR): Percentage of peptide positions at which the generated sequence exactly matches the native, formalized as
$\mathrm{AAR} = \frac{1}{N}\sum_{j=1}^N \mathbf{1}\{a_j^\mathrm{pred}=a_j^\mathrm{gt}\} \times 100\%.$ - C $_\alpha$ RMSD: Root-mean-square deviation between predicted and native C $_\alpha$ coordinates,

$\mathrm{RMSD} = \sqrt{\frac{1}{N}\sum_{j=1}^N\left\|\mathbf{x}_j^\mathrm{pred}-\mathbf{x}_j^\mathrm{gt}\right\|^2} \text{ (\AA)}.$ - Secondary Structure Similarity Ratio (SSR):

$\mathrm{SSR} = \frac{1}{N}\sum_{j=1}^N \mathbf{1}\{\mathit{SS}_j^\mathrm{pred}=\mathit{SS}_j^\mathrm{gt}\} \times 100\%.$ - Binding-Site Overlap (BSR): Site-wise overlap between predicted and native binding footprints,

$\mathrm{BSR} = \frac{|B^\mathrm{pred}\cap B^\mathrm{gt}|}{|B^\mathrm{pred}\cup B^\mathrm{gt}|} \times 100\%.$
Energy Metrics:
- Complex Stability (Stb): Fraction of predictions with lower Rosetta all-atom energy than the native,
$\mathrm{Stb} = \frac{1}{M}\sum_{i=1}^M \mathbf{1}\{E_\mathrm{pred}^{(i)}<E_\mathrm{gt}^{(i)}\} \times 100\%.$ - Binding Affinity Improvement (Aff): Fraction of predictions with stronger (lower) binding affinity (Rosetta $\Delta G$ ) than native,

$\mathrm{Aff} = \frac{1}{M}\sum_{i=1}^M \mathbf{1}\{\Delta G_\mathrm{pred}^{(i)}<\Delta G_\mathrm{gt}^{(i)}\} \times 100\%.$
Design Metrics:
- Designability (Des): Fraction of predictions that refold to within 2 Å RMSD of original backbone using ESMFold,
$\mathrm{Des} = \frac{1}{M}\sum_{i=1}^M \mathbf{1}\{\mathrm{RMSD}_\mathrm{fold}^{(i)}<2\,\text{\AA}\} \times 100\%.$ - Diversity (Div): Diversity quantified as one minus the mean pairwise TM-score among designs,

$\mathrm{Div}=1-\frac{2}{M(M-1)}\sum_{i<j}\mathrm{TM}(P_i,P_j).$

Table: Comparative performance of generative models on the PepMerge test split (Wu et al., 8 Jan 2026):

Method	AAR %	RMSD (Å)	SSR %	BSR %	Stb %	Aff %	Des %	Div
Diffusion	47.04	3.28	74.89	49.83	15.34	17.13	48.54	0.57
PepGLAD	50.43	3.83	80.24	19.34	20.39	10.47	75.07	0.32
PPIFlow	48.35	3.59	68.13	25.94	15.77	12.08	46.53	0.51
PepFlow	51.25	2.07	83.46	86.89	18.15	21.37	65.22	0.42
SurfFlow	54.07	1.96	85.11	87.38	22.46	22.51	73.60	0.61

For method details, see (Wu et al., 8 Jan 2026).

3. Modeling Innovations and Methodological Advances

PepMerge facilitates direct assessment of next-generation generative modeling architectures. Recent publications demonstrate that surface-based representations—which exploit molecular surface geometry and physicochemical features—yield superior generative performance compared to full-atom backbones alone. The SurfFlow architecture, for example, incorporates the following components (Wu et al., 8 Jan 2026):

Multi-Modality Conditional Flow Matching: Applied separately to continuous (surface points, normals) and categorical (surface annotation) modalities via continuous-time Markov chains, with distinct vector field predictors for each feature type.
Equivariant Surface Geometric Network (ESGN): Models intra- and inter-surface graphs with radial basis function (RBF) and spherical Fourier–Bessel (SBF) message-passing, soft attention across surfaces, and strict equivariance to 3D rigid body transformations ( $\mathrm{SE}(3)$ -equivariance).
Classifier-Free Guidance: Enables conditional generation (e.g., imposing cyclization) with no need for explicit external classifiers.

A plausible implication is that benchmarks like PepMerge are catalyzing the transition from sequence-based to surface-based molecular generative design, enabling more accurate modeling of shape complementarity and interfacial chemistry.

4. Surrogate PepMerge (SMM-Bench) for Model-Merging Optimization

The PepMerge name is also used for a surrogate benchmark suite (referred to as SMM-Bench) designed for evaluating hyperparameter optimization (HPO) algorithms in LLM model merging (Akizuki et al., 2 Sep 2025). This benchmark provides a very low-cost proxy for the expensive function evaluations associated with true model merging.

Key elements include:

Two Search Spaces:
- Parameter-Space (PS) Merging: 64 continuous hyperparameters ( $h\in[0,1]^{64}$ ) representing layer-wise arithmetic weights.
- Data-Flow-Space (DFS) Merging: 32 categorical ( $c_i\in\{0,1,2\}$ for each layer—inserting layer from model A, B, or none) and 63 continuous ( $s_j\in[0.4,1.5]$ ) scaling parameters.
Regression Surrogates: LightGBM regressors trained on large grids of sampled (hyperparameter, true accuracy) pairs, supporting two evaluation datasets (gsm8k-ja, MGSM).
Instantaneous Evaluation: Surrogate functions $f_\mathrm{sur}$ provide predicted accuracy for any hyperparameter vector in $<1$ ms per call.

Example usage:

from smm_bench import SurrogatePS, SurrogateDFS
ps_sur = SurrogatePS(dataset="gsm8k-ja")    # or "MGSM"
dfs_sur = SurrogateDFS(dataset="gsm8k-ja")  # or "MGSM"

def objective_PS(h_vector):
    # h_vector: length-64 array in [0,1]
    return ps_sur.predict(h_vector)  # surrogate accuracy

This setup is specifically intended for rapid development and benchmarking of black-box HPO algorithms (e.g., Bayesian optimization, CMA-ES).

5. Experimental Protocols and Best Practices

For molecular design, protocol details include strict train/val/test separation by receptor sequence cluster; thus, test entries are structurally non-redundant with training data, emulating real-world generalization requirements (Wu et al., 8 Jan 2026). For SMM-Bench, all true accuracy values are precomputed and stored, and both surrogate fidelity and HPO optimizer performance can be evaluated via standard metrics:

Surrogate Metrics: MSE, coefficient of determination ( $R^2$ ), Kendall’s Tau ( $\tau$ ).
Optimizer Trajectories: Best-so-far accuracy versus function evaluations.

Best practices recommend starting with Parameter-Space merging, using only prescribed hyperparameter domains, and validating optimizers under both uniform-sampled and full datasets to assess surrogate bias and robustness (Akizuki et al., 2 Sep 2025).

6. Applications and Scientific Impact

The PepMerge benchmark in peptide design has already supported the development of advanced models for both L-peptides and D-peptides, including D-Flow (Wu et al., 2024) and SurfFlow (Wu et al., 8 Jan 2026). By offering a standardized, clustered, and biologically realistic testbed, it facilitates fair cross-method comparison and drives methodological innovation—particularly for models leveraging protein surface information.

In HPO and AutoML for model merging, SMM-Bench enables systematic, reproducible assessment of optimization methods without the prohibitive costs of repeated full LLM merges and evaluations. This accelerates the cycle of optimizer design and validation, encouraging methodological advances in neural model surgery and transfer.

The PepMerge and SMM-Bench benchmarks are accessible, reproducible, and extensible platforms. Source code for peptide generative modeling is available via public repositories (e.g., D-Flow at https://github.com/smiles724/PeptideDesign (Wu et al., 2024)). As benchmarking culture matures, plausible future directions include extending PepMerge to deeper coverage of peptide/receptor chemistries, integrating additional biophysically relevant metrics, and evolving SMM-Bench to higher-dimensional, multi-source LLM merging and more challenging black-box HPO scenarios. The PepMerge framework thus exemplifies the rigorous, metric-driven benchmarking paradigm required for credible progress in molecular design and neural model composition.