Explain ProCALM’s successful generation for EC 7.1.1.2 and 7.1.1.9

Ascertain why ProCALM, a ProGen2-based protein language model finetuned with conditional adapters for joint conditioning on enzyme commission (EC) number and taxonomy, successfully generated sequences corresponding to EC 7.1.1.2 and EC 7.1.1.9 even when bacterial sequences for these EC classes were held out during training; identify the model- and data-related factors that enable this outcome.

Background

The study introduces ProCALM, a conditional adapter approach applied to ProGen2 for generating enzyme sequences conditioned on EC numbers and taxonomy, including joint conditioning via parallel adapters. To probe out-of-distribution generalization, the authors held out all bacterial sequences for selected EC classes during training and then attempted to generate bacterial sequences for those classes.

In some cases, ProCALM generated bacterial sequences that both matched the target EC and mapped to bacteria despite the absence of bacterial training examples for those classes. Two highlighted EC classes, 7.1.1.2 and 7.1.1.9, are transmembrane enzymes that are part of large complexes, and the authors explicitly note that the reasons behind the model’s success for these particular functions are not clear, motivating investigation into the mechanisms underlying these successful generations.

References

7.1.1.2 and 7.1.1.9 are transmembrane enzymes part of large complexes, but it is not clear why these particular functions were successfully generated.

Function-Guided Conditional Generation Using Protein Language Models with Adapters  (2410.03634 - Yang et al., 2024) in Discussion