Representativeness of Yearly UniRef100 Releases to the Natural Protein Distribution

Ascertain how representative each yearly UniRef100 release dataset D_i (2015–2024) is of the underlying distribution of natural protein sequences P*, from which these yearly snapshots are presumed to be sampled in the CoPeP continual pretraining setting.

Background

CoPeP organizes continual pretraining across 10 yearly UniRef100 releases (2015–2024), treating each release as a distinct task while assuming all are sampled from a shared but inaccessible distribution of natural proteins (denoted P*). While distribution shifts occur between consecutive years due to database growth and curation, the relationship between each snapshot and this underlying distribution is not directly observable.

The authors explicitly note that it is unknown how representative each dataset D_i is of P*, and that larger yearly increments do not necessarily imply improved alignment with P*. Establishing the representativeness of each release would clarify how well training on a given year approximates the true natural protein distribution and would inform the design and evaluation of continual pretraining methods in this setting.

References

Additionally, it is unknown how representative $D_i$ is of $\mathcal{P}^*$, with the added challenge that yearly increments of the dataset do not correlate with improvements of $\mathcal{P}_i$ w.r.t. $\mathcal{P}^{*$~\citep{fournierProteinLanguageModels2024,spinnerScalingDataSaturation2025}.}

— CoPeP: Benchmarking Continual Pretraining for Protein Language Models (2603.00253 - Patil et al., 27 Feb 2026) in Streaming Protocol, Section 3.2

Representativeness of Yearly UniRef100 Releases to the Natural Protein Distribution

Background

References

Related Problems