Trends In Biological Sequence Data
@owid.epoch_database_growth
@owid.epoch_database_growth
Growth of key biological sequence databases between January 1976 and January 2024.
Biological sequence data used to train biological sequence models is provided by a vast array of public databases compiled by government, academic, and private institutions. Epoch delineates major sources into three primary categories:
DNA sequence databases. These have the highest growth rate of analyzed databases, with GenBank seeing a 31% increase in the number of sequences stored between 2022 and 2023. Whole genome shotgun sequencing studies have been the driving force of growth of DNA data, as the increase in number of entries in all other GenBank divisions, referred to as traditional entries, is greatly attenuated in comparison.
Protein sequence databases. The level of detail in protein sequence databases can vary. Databases with rich annotations such as UniProtKB have a much slower growth rate (6.7%), compared to metagenomic databases such as MGnify (20%), which provide protein sequences but lack detailed information about the protein’s structure, function, and origin.
Protein structure databases. Gathering experimental data on protein structures is slow and painstaking. Thus, the Protein Data Bank grows by only 6.5% per year. Instead, databases publishing protein structures predicted by AI models can quickly generate large volumes of synthetic data. Databases of synthetic data such as AlphaFoldDB and ESMAtlas have dramatically boosted the supply of available data, though their growth could slow as opportunities for synthetic data are exhausted.
The majority of entries in large biological databases such as the International Nucleotide Sequence Database Collaboration (INSDC), MGnify, UniProtKB and PDB pertain to cellular organisms (humans, animals, plants, fungi, yeast, bacteria). For example, UniProtKB entries comprise 97% cellular and 2% viral protein sequences, a subset of which are known pathogens.
CREATE TABLE owid_epoch_database_growth (
"year" INTEGER,
"gb_all_reported_sequences" UINTEGER,
"pdb_total_number_of_entries_available" UINTEGER,
"uniprot_uniprotkb_swiss_prot" UINTEGER,
"alpha_fold_number_of_predicted_structures" UINTEGER,
"esm_atlas_number_of_predicted_structures" UINTEGER,
"refseq_records" UINTEGER
);Anyone who has the link will be able to view this.