Growth of key biological sequence databases between January 1976 and January 2024.

Biological sequence data used to train biological sequence models is provided by a vast array of public databases compiled by government, academic, and private institutions. Epoch delineates major sources into three primary categories:

DNA sequence databases. These have the highest growth rate of analyzed databases, with GenBank seeing a 31% increase in the number of sequences stored between 2022 and 2023. Whole genome shotgun sequencing studies have been the driving force of growth of DNA data, as the increase in number of entries in all other GenBank divisions, referred to as traditional entries, is greatly attenuated in comparison.
Protein sequence databases. The level of detail in protein sequence databases can vary. Databases with rich annotations such as UniProtKB have a much slower growth rate (6.7%), compared to metagenomic databases such as MGnify (20%), which provide protein sequences but lack detailed information about the protein’s structure, function, and origin.
Protein structure databases. Gathering experimental data on protein structures is slow and painstaking. Thus, the Protein Data Bank grows by only 6.5% per year. Instead, databases publishing protein structures predicted by AI models can quickly generate large volumes of synthetic data. Databases of synthetic data such as AlphaFoldDB and ESMAtlas have dramatically boosted the supply of available data, though their growth could slow as opportunities for synthetic data are exhausted.

The majority of entries in large biological databases such as the International Nucleotide Sequence Database Collaboration (INSDC), MGnify, UniProtKB and PDB pertain to cellular organisms (humans, animals, plants, fungi, yeast, bacteria). For example, UniProtKB entries comprise 97% cellular and 2% viral protein sequences, a subset of which are known pathogens.

Trends In Biological Sequence Data

Related Datasets

HoloBee Database V2016.1

Data From: Development Of A Versatile Resource From 1500 Diverse Genomes For Post-genomics Research

Annotated Reference Transcriptome For Female Culicoides Sonorensis Biting Midges

HELIX01-04 Part 01 | Sequencing Runs From Motif-based DNA Data Storage Systems

DNA Sequencing Costs

Wastewater In-silico NGS Dataset