Representativeness of Yearly UniRef100 Releases to the Natural Protein Distribution
Ascertain how representative each yearly UniRef100 release dataset D_i (2015–2024) is of the underlying distribution of natural protein sequences P*, from which these yearly snapshots are presumed to be sampled in the CoPeP continual pretraining setting.
References
Additionally, it is unknown how representative $D_i$ is of $\mathcal{P}*$, with the added challenge that yearly increments of the dataset do not correlate with improvements of $\mathcal{P}_i$ w.r.t. $\mathcal{P}*$~\citep{fournierProteinLanguageModels2024,spinnerScalingDataSaturation2025}.
— CoPeP: Benchmarking Continual Pretraining for Protein Language Models
(2603.00253 - Patil et al., 27 Feb 2026) in Streaming Protocol, Section 3.2