Predictive accuracy of Potts models for overlapping-gene sequences far from training distributions

Establish the empirical predictive accuracy of Direct Coupling Analysis Potts models trained on multiple sequence alignments when applied as fitness proxies to de novo designed overlapping-gene sequences that deviate substantially from the training distribution.

Background

The study designs overlapping genes using Potts models inferred from multiple sequence alignments and uses model energies as proxies for fitness. While Potts models have been validated for generating functional single-family proteins, the authors note that overlapping-gene designs substantially deviate from the training distribution of natural sequences.

This deviation raises concerns about whether Potts model energies remain reliable indicators of folding and function in this regime, motivating direct experimental assessment of predictive accuracy for the overlapped sequences.

References

Most notably, while Potts models have been validated experimentally for single protein families , their accuracy as fitness predictors for sequences that deviate substantially from the training distribution -- as our overlapping sequences necessarily do -- remains uncertain.

The fitness landscape of overlapping genes  (2604.00602 - Kirsch et al., 1 Apr 2026) in Discussion