Baselight

COVID-19/SARS B-cell Epitope Prediction

A simple dataset for epitope prediction used in vaccine development

@kaggle.futurecorporation_epitope_prediction

Loading...
Loading...

About this Dataset

COVID-19/SARS B-cell Epitope Prediction

Introduction

Due to spread of COVID-19, vaccine development is beingdemanded as soon as possible. Despite the importance of data analysis in vaccine development, there are not many simple data sets that data analysts can handle. We published the dataset and the sample code for Bcell epitope prediction, one of the key research topics in vaccine development, available for free. This dataset was developed during our research process and the data contained in it was obtained from IEDB and UniProt. We would like to express our deepest gratitude to everyone who helped us. We briefly describe the B-cell epitope predictions covered by this dataset.  For details, please refer to our paper, our blog (only in Japanese), and others. B-cells inducing antigen-specific immune responses in vivo produce large amounts of antigen-specific antibodies by recognizing the subregions (epitope regions) of antigen proteins. They can inhibit their functioning by binding antibodies to antigen proteins. Predicting of epitope regions is beneficial for the design and development of vaccines aimed to induce antigen-specific antibody production. We believe that this dataset and code will be widely useful not only for COVID-19 but also for future medical data analysis.

The data: Information on whether or not an amino acid peptide exhibited antibody-inducing activity (marked by an activity label) could be obtained from IEDB, which was used in many previous studies. Accordingly, this information was used as the label data. We also obtained the epitope candidate amino acid sequences (peptides) and the activity label data from the B-cell epitope data provided in IEDB. The presented antibody proteins were restricted to IgG that constituted the most recorded type in IEDB. For convenience, we excluded records representing different quantitative measures of antibody activity for the same peptide from experiments. The epitope data obtained from IEDB corresponded to the five types of activity: "Positive-High," "Positive-Intermediate," "Positive-Low," "Positive," and "Negative." However, due to the limited number of data elements marked with the "Positive-High," "Positive-Intermediate," and "Positive-Low" labels, we equally considered these labels as "Positive", thereby attributing the task to a binary estimation.

Content

This contains three data files:

  • input_bcell.csv : this is our main training data. The number of rows is 14387 for all combinations of 14362 peptides and 757 proteins.
  • input_sars.csv : this is also our main training data. The number of rows is 520.
  • input_covid.csv : this is our target data. there is no label data in columns.

All of three datasets consists of information of protein and peptide:

  • parent_protein_id : parent protein ID
  • protein_seq : parent protein sequence
  • start_position : start position of peptide
  • end_position : end position of peptide
  • peptide_seq : peptide sequence
  • chou_fasman : peptide feature, β turn
  • emini : peptide feature, relative surface accessibility
  • kolaskar_tongaonkar : peptide feature, antigenicity
  • parker : peptide feature, hydrophobicity
  • isoelectric_point : protein feature
  • aromacity: protein feature
  • hydrophobicity : protein feature
  • stability : protein feature

and bcell and sars dataset have antibody valence(target value):

  • target : antibody valence (target value)

Relevant Papers

Epitope Prediction of Antigen Protein using Attention-Based LSTM Network (2020, bioRxiv)

Acknowledgements

Data is provided from The Immune Epitope Database(IEDB) and UniProt.

Inspiration

Automated methods for B-cell epitope prediction.Machine learning helps rapid vaccine development.

Disclaimer

We make no warranties regarding the correctness of the data, and disclaim liability for damages resulting from its use. We cannot provide unrestricted permission regarding the use of the data, as some data may be covered by patents or other rights.

Tables

Input Bcell

@kaggle.futurecorporation_epitope_prediction.input_bcell
  • 794.5 KB
  • 14387 rows
  • 14 columns
Loading...

CREATE TABLE input_bcell (
  "parent_protein_id" VARCHAR,
  "protein_seq" VARCHAR,
  "start_position" BIGINT,
  "end_position" BIGINT,
  "peptide_seq" VARCHAR,
  "chou_fasman" DOUBLE,
  "emini" DOUBLE,
  "kolaskar_tongaonkar" DOUBLE,
  "parker" DOUBLE,
  "isoelectric_point" DOUBLE,
  "aromaticity" DOUBLE,
  "hydrophobicity" DOUBLE,
  "stability" DOUBLE,
  "target" BIGINT
);

Input Covid

@kaggle.futurecorporation_epitope_prediction.input_covid
  • 387.29 KB
  • 20312 rows
  • 13 columns
Loading...

CREATE TABLE input_covid (
  "parent_protein_id" VARCHAR,
  "protein_seq" VARCHAR,
  "start_position" BIGINT,
  "end_position" BIGINT,
  "peptide_seq" VARCHAR,
  "chou_fasman" DOUBLE,
  "emini" DOUBLE,
  "kolaskar_tongaonkar" DOUBLE,
  "parker" DOUBLE,
  "isoelectric_point" DOUBLE,
  "aromaticity" DOUBLE,
  "hydrophobicity" DOUBLE,
  "stability" DOUBLE
);

Input Sars

@kaggle.futurecorporation_epitope_prediction.input_sars
  • 37.53 KB
  • 520 rows
  • 14 columns
Loading...

CREATE TABLE input_sars (
  "parent_protein_id" VARCHAR,
  "protein_seq" VARCHAR,
  "start_position" BIGINT,
  "end_position" BIGINT,
  "peptide_seq" VARCHAR,
  "chou_fasman" DOUBLE,
  "emini" DOUBLE,
  "kolaskar_tongaonkar" DOUBLE,
  "parker" DOUBLE,
  "isoelectric_point" DOUBLE,
  "aromaticity" DOUBLE,
  "hydrophobicity" DOUBLE,
  "stability" DOUBLE,
  "target" BIGINT
);

Share link

Anyone who has the link will be able to view this.