Baselight

Condon Usage Dataset

DNA codon usage frequencies of a large sample of diverse biological organisms

@kaggle.meetnagadia_condon_usage_dataset

About this Dataset

Condon Usage Dataset

Dataset Information:

We examined codon usage frequencies in the genomic coding DNA of a large sample of diverse organisms from different taxa tabulated in the CUTG database, where we further manually curated and harmonized these existing entries by re-classifying CUTG's bacteria (bct) class into archaea (arc), plasmids (plm), and bacteria proper (keeping with the original label bct'). The reclassification in the original bct' domain was simplified by extracting from files qbxxx.spsum.txt' (where xxx = bct (bacteria), inv (invertebrates), mam (mammals), pln (plants), pri (primates), rod (rodents), vrt (vertebrates)) the different genus names of the entries, and making the classification by genus. There were 514 different genus names. The different genus categories were checked and relabeled as arc' where appropriate. In the eubacterial entries, the distinction was made of the bacterial genomes proper (keeping with the original label bct'), and bacterial plasmids (now labeled plm').

Following these preprocessing steps, the final dataset file comprises all entries of the CUTG databases qbxxx.spsum.txt in one text file. As detailed above, the qbbct.spsum.txt entries were separated as bct' (that is, eubacteria), plm' (plasmids), and `arc' (archaea), a distinction not originally made in the CUTG database.

Source:

Bohdan Khomtchouk, Ph.D. University of Chicago, Department of Medicine, Section of Computational Biomedicine and Biomedical Data Science. Email: bohdan '@' uchicago.edu
Dataset link

Share link

Anyone who has the link will be able to view this.