Dataset: Drosophila Melanogaster Genome

About this Dataset

Drosophila Melanogaster Genome

Drosophila Melanogaster

Drosophila Melanogaster, the common fruit fly, is a model organism which has been extensively used in entymological research. It is one of the most studied organisms in biological research, particularly in genetics and developmental biology.

When its not being used for scientific research, D. melanogaster is a common pest in homes, restaurants, and anywhere else that serves food. They are not to be confused with Tephritidae flys (also known as fruit flys).

https://en.wikipedia.org/wiki/Drosophila_melanogaster

About the Genome

This genome was first sequenced in 2000. It contains four pairs of chromosomes (2,3,4 and X/Y). More than 60% of the genome appears to be functional non-protein-coding DNA.

The genome is maintained and frequently updated at FlyBase. This dataset is sourced from the UCSC Genome Bioinformatics download page. It uses the August 2014 version of the D. melanogaster genome (dm6, BDGP Release 6 + ISO1 MT). http://hgdownload.soe.ucsc.edu/downloads.html#fruitfly

Files were modified by Kaggle to be a better fit for analysis on Scripts. This primarily involved turning files into CSV format, with a header row, as well as converting the genome itself from 2bit format into a FASTA sequence file.

Bioinformatics

Genomic analysis can be daunting to data scientists who haven't had much experience with bioinformatics before. We have tried to give basic explanations to each of the files in this dataset, as well as links to further reading on the biological basis for each. If you haven't had the chance to study much biology before, some light reading (ie wikipedia) on the following topics may be helpful to understand the nuances of the data provided here: Genetics, Genomics (Sequencing/Genome Assembly), Chromosomes, DNA, RNA (mRNA/miRNA), Genes, Alleles, Exons, Introns, Transcription, Translation, Peptides, Proteins, Gene Regulation, Mutation, Phylogenetics, and SNPs.

Of course, if you've got some idea of the basics already - don't be afraid to jump right in!

Learning Bioinformatics

There are a lot of great resources for learning bioinformatics on the web. One cool site is Rosalind - a platform that gives you bioinformatic coding challenges to complete. You can use Kaggle Scripts on this dataset to easily complete the challenges on Rosalind (and see Myles' solutions here if you get stuck). We have set up Biopython on Kaggle's docker image which is a great library to help you with your analyses. Check out their tutorial here and we've also created a python notebook with some of the tutorial applied to this dataset as a reference.

Files in this Dataset

genome.fa

The assembled genome itself is presented here in FASTA format. Each chromosome is a different sequence of nucleotides. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are show in lower case; non-repeating sequence is shown in upper case.

There are 3 additional files with meta information about the genome.

meta-cpg-island-ext-unmasked.csv

This file contains descriptive information about CpG Islands in the genome.

https://en.wikipedia.org/wiki/CpG_site

meta-cytoband.csv

This file describes the positions of cytogenic bands on each chromosome.

https://en.wikipedia.org/wiki/Cytogenetics

meta-simple-repeat.csv

This file describes simple tandem repeats in the genome.

https://en.wikipedia.org/wiki/Repeated_sequence_(DNA)
https://en.wikipedia.org/wiki/Tandem_repeat

Messenger RNA (mRNA) is an intermediate molecule created as part of the cellular process of converting genomic information into proteins. Some mRNA are never translated into proteins and have functional roles in the cell on their own. Collectively, organism mRNA information is known as a Transcriptome. mRNA files included in this dataset give insight into the activity of genes in the organism.

https://en.wikipedia.org/wiki/Messenger_RNA

mrna-genbank.fa

This file includes all mRNA sequences from GenBank associated with Drosophila Melanogaster.

http://www.ncbi.nlm.nih.gov/genbank/

mrna-refseq.fa

This file includes all mRNA sequences from RefSeq associated with Drosophila Melanogaster.

http://www.ncbi.nlm.nih.gov/refseq/

A gene is a segment of DNA on the genome which, through mRNA, is used to create proteins in the organism. Knowing which parts of DNA are coding (genes) or non-coding is difficult, and a number of different systems for prediction exist. This dataset includes a number of different gene prediction systems applied to the drosophila melanogaster genome.

https://en.wikipedia.org/wiki/Gene_prediction

genes-augustus.csv

AUGUSTUS is a piece of software that predicts genes ab initio using Hidden Markov Models.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC441517/

genes-genscan.csv

GENSCAN is an older ab initio software for predicting genes.
http://genes.mit.edu/GENSCANinfo.html

genes-ensembl.csv
ensembl-gtp.csv
ensembl-pep.csv
ensembl-source.csv
ensembl-to-gene-name.csv

Ensembl provides gene annotation generated by their software Genebuild. This process combines automatic annotation alongside manual curation.
http://uswest.ensembl.org/info/genome/genebuild/genome_annotation.html

We have also included some supplementary files for these, including predicted protein peptide sequences for each predicted gene.

genes-refseq.csv
genes-xeno-refseq.csv
refseq-link.csv
refseq-summary.csv

We have included two RefSeq gene predictions in this dataset. The first is based solely on information from the drosophila melanogaster genome. The second (genes-xeno-refseq.csv) uses genes from other organisms as a basis for predicting genes in drosophila melanogaster.

RefSeq RNAs were aligned against the D. melanogaster genome using blat; those with an alignment of less than 15% were discarded. When a single RNA aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.1% of the best and at least 96% base identity with the genomic sequence were kept.

We have also included supplementary files for these which include information about the genes that have been identified.

http://www.ncbi.nlm.nih.gov/refseq/

What can you do with this data?

Genomic data is the foundation of bioinformatics, and there is an incredible array of things you can do with this data. A good place to start is to look at some of the meta supplementary files alongside the genomic sequence itself.

We have a number of different gene prediction systems in the dataset, how do they compare to each other? How do they compare to the mRNA data?

Working back from the refseq-summary.csv file, you can look at genes that code for particular proteins - can you find these genes in the genome?

How much of the genome codes for the mRNA's found in our mRNA data? Of the mRNA's we have, how many map to the predicted genes and the predicted peptided sequence data? How much of the mRNA seems to be protein-coding vs how much looks like it is miRNA? Can you find pre-mRNA or splice variants within the mRNA data? Does meta information like cytogenic bands or CpG sites correspond with splice variants or a lack of mRNA altogether?

Those are just some of many ideas that could get you started.

Looking for Feedback

This is the first genomic dataset on Kaggle and we are looking for feedback from our community about how interesting this dataset is to them, or if there are ways we could improve it to better suit analysis. Please post suggestions for supplementary data, future genomes we could host, bioinformatics packages we should include on scripts, and any other feedback on the dataset forum.

Tables

Ensembl Gtp

@kaggle.mylesoneill_drosophila_melanogaster_genome.ensembl_gtp

610.96 KB
34681 rows
3 columns


CREATE TABLE ensembl_gtp (
  "gene" VARCHAR,
  "transcript" VARCHAR,
  "protein" VARCHAR
);

Ensembl Pep

@kaggle.mylesoneill_drosophila_melanogaster_genome.ensembl_pep

15.12 MB
30318 rows
2 columns


CREATE TABLE ensembl_pep (
  "name" VARCHAR,
  "seq" VARCHAR
);

Ensembl Source

@kaggle.mylesoneill_drosophila_melanogaster_genome.ensembl_source

239.33 KB
34681 rows
2 columns


CREATE TABLE ensembl_source (
  "name" VARCHAR,
  "source" VARCHAR
);

Ensembl To Gene Name

@kaggle.mylesoneill_drosophila_melanogaster_genome.ensembl_to_gene_name

408.02 KB
34681 rows
2 columns


CREATE TABLE ensembl_to_gene_name (
  "name" VARCHAR,
  "value" VARCHAR
);

Genes Augustus

@kaggle.mylesoneill_drosophila_melanogaster_genome.genes_augustus

1.38 MB
13509 rows
16 columns


CREATE TABLE genes_augustus (
  "bin" BIGINT,
  "name" VARCHAR,
  "chrom" VARCHAR,
  "strand" VARCHAR,
  "txstart" BIGINT,
  "txend" BIGINT,
  "cdsstart" BIGINT,
  "cdsend" BIGINT,
  "exoncount" BIGINT,
  "exonstarts" VARCHAR,
  "exonends" VARCHAR,
  "score" BIGINT,
  "name2" VARCHAR,
  "cdsstartstat" VARCHAR,
  "cdsendstat" VARCHAR,
  "exonframes" VARCHAR
);

Genes Ensembl

@kaggle.mylesoneill_drosophila_melanogaster_genome.genes_ensembl

2.47 MB
34681 rows
16 columns


CREATE TABLE genes_ensembl (
  "bin" BIGINT,
  "name" VARCHAR,
  "chrom" VARCHAR,
  "strand" VARCHAR,
  "txstart" BIGINT,
  "txend" BIGINT,
  "cdsstart" BIGINT,
  "cdsend" BIGINT,
  "exoncount" BIGINT,
  "exonstarts" VARCHAR,
  "exonends" VARCHAR,
  "score" BIGINT,
  "name2" VARCHAR,
  "cdsstartstat" VARCHAR,
  "cdsendstat" VARCHAR,
  "exonframes" VARCHAR
);

Genes Genscan

@kaggle.mylesoneill_drosophila_melanogaster_genome.genes_genscan

1.51 MB
16274 rows
11 columns


CREATE TABLE genes_genscan (
  "bin" BIGINT,
  "name" VARCHAR,
  "chrom" VARCHAR,
  "strand" VARCHAR,
  "txstart" BIGINT,
  "txend" BIGINT,
  "cdsstart" BIGINT,
  "cdsend" BIGINT,
  "exoncount" BIGINT,
  "exonstarts" VARCHAR,
  "exonends" VARCHAR
);

Genes Refseq

@kaggle.mylesoneill_drosophila_melanogaster_genome.genes_refseq

3.13 MB
36099 rows
16 columns


CREATE TABLE genes_refseq (
  "bin" BIGINT,
  "name" VARCHAR,
  "chrom" VARCHAR,
  "strand" VARCHAR,
  "txstart" BIGINT,
  "txend" BIGINT,
  "cdsstart" BIGINT,
  "cdsend" BIGINT,
  "exoncount" BIGINT,
  "exonstarts" VARCHAR,
  "exonends" VARCHAR,
  "score" BIGINT,
  "name2" VARCHAR,
  "cdsstartstat" VARCHAR,
  "csdendstat" VARCHAR,
  "exonframes" VARCHAR
);

Genes Xeno Refseq

@kaggle.mylesoneill_drosophila_melanogaster_genome.genes_xeno_refseq

3.15 MB
51168 rows
16 columns


CREATE TABLE genes_xeno_refseq (
  "bin" BIGINT,
  "name" VARCHAR,
  "chrom" VARCHAR,
  "strand" VARCHAR,
  "txstart" BIGINT,
  "txend" BIGINT,
  "cdsstart" BIGINT,
  "cdsend" BIGINT,
  "exoncount" BIGINT,
  "exonstarts" VARCHAR,
  "exonends" VARCHAR,
  "score" BIGINT,
  "name2" VARCHAR,
  "cdsstartstat" VARCHAR,
  "cdsendstat" VARCHAR,
  "exonframes" VARCHAR
);

Meta Cpg Island Ext Unmasked

@kaggle.mylesoneill_drosophila_melanogaster_genome.meta_cpg_island_ext_unmasked

646.22 KB
27513 rows
11 columns


CREATE TABLE meta_cpg_island_ext_unmasked (
  "bin" BIGINT,
  "chrom" VARCHAR,
  "chromstart" BIGINT,
  "chromend" BIGINT,
  "name" VARCHAR,
  "length" BIGINT,
  "cpgnum" BIGINT,
  "gcnum" BIGINT,
  "percpg" DOUBLE,
  "pergc" DOUBLE,
  "obsexp" DOUBLE
);

Meta Cytoband

@kaggle.mylesoneill_drosophila_melanogaster_genome.meta_cytoband

121.65 KB
6917 rows
5 columns


CREATE TABLE meta_cytoband (
  "chrom" VARCHAR,
  "chromstart" BIGINT,
  "chromend" BIGINT,
  "name" VARCHAR,
  "giestain" VARCHAR
);

Meta Simple Repeat

@kaggle.mylesoneill_drosophila_melanogaster_genome.meta_simple_repeat

1.3 MB
35076 rows
17 columns


CREATE TABLE meta_simple_repeat (
  "bin" BIGINT,
  "chrom" VARCHAR,
  "chromstart" BIGINT,
  "chromend" BIGINT,
  "name" VARCHAR,
  "period" BIGINT,
  "copynum" DOUBLE,
  "consensussize" BIGINT,
  "permatch" BIGINT,
  "perindel" BIGINT,
  "score" BIGINT,
  "a" BIGINT,
  "c" BIGINT,
  "g" BIGINT,
  "t" BIGINT,
  "entropy" DOUBLE,
  "sequence" VARCHAR
);

Refseq Link

@kaggle.mylesoneill_drosophila_melanogaster_genome.refseq_link

16.23 MB
444496 rows
8 columns


CREATE TABLE refseq_link (
  "name" VARCHAR,
  "product" VARCHAR,
  "mrnaacc" VARCHAR,
  "protacc" VARCHAR,
  "genename" BIGINT,
  "prodname" BIGINT,
  "locuslinkid" BIGINT,
  "omimid" BIGINT
);

Refseq Summary

@kaggle.mylesoneill_drosophila_melanogaster_genome.refseq_summary

9.95 MB
135886 rows
3 columns


CREATE TABLE refseq_summary (
  "mrnaacc" VARCHAR,
  "completeness" VARCHAR,
  "summary" VARCHAR
);