Dataset Information:
We examined codon usage frequencies in the genomic coding DNA of a large sample of diverse organisms from different taxa tabulated in the CUTG database, where we further manually curated and harmonized these existing entries by re-classifying CUTG's bacteria (bct) class into archaea (arc), plasmids (plm), and bacteria proper (keeping with the original label bct'). The reclassification in the original
bct' domain was simplified by extracting from files qbxxx.spsum.txt' (where xxx = bct (bacteria), inv (invertebrates), mam (mammals), pln (plants), pri (primates), rod (rodents), vrt (vertebrates)) the different genus names of the entries, and making the classification by genus. There were 514 different genus names. The different genus categories were checked and relabeled as
arc' where appropriate. In the eubacterial entries, the distinction was made of the bacterial genomes proper (keeping with the original label bct'), and bacterial plasmids (now labeled
plm').
Following these preprocessing steps, the final dataset file comprises all entries of the CUTG databases qbxxx.spsum.txt in one text file. As detailed above, the qbbct.spsum.txt entries were separated as bct' (that is, eubacteria),
plm' (plasmids), and `arc' (archaea), a distinction not originally made in the CUTG database.
Source:
Bohdan Khomtchouk, Ph.D. University of Chicago, Department of Medicine, Section of Computational Biomedicine and Biomedical Data Science. Email: bohdan '@' uchicago.edu
Dataset link