Name: Protein Secondary Sequence
Creator: Kaggle
License: http://opendatacommons.org/licenses/dbcl/1.0/

About this Dataset

Protein Secondary Sequence

# Introduction
Protein secondary structure can be calculated based on its atoms' 3D coordinates once the protein's 3D structure is solved using X-ray crystallography or NMR. Commonly, DSSP is the tool used for calculating the secondary structure and assigns one of the following secondary structure types (https://swift.cmbi.umcn.nl/gv/dssp/index.html) to every amino acid in a protein:

C: Loops and irregular elements (corresponding to the blank characters output by DSSP)
E: β-strand
H: α-helix
B: β-bridge
G: 3-helix
I: π-helix
T: Turn
S: Bend

However, X-ray or NMR is expensive. Ideally, we would like to predict the secondary structure of a protein based on its primary sequence directly, which has had a long history.

Dataset

The main dataset lists peptide sequences and their corresponding secondary structures.
Description of columns:

pdb_id: the id used to locate its entry
seq: the sequence of the peptide
sst3: the three-state (Q3) secondary structure
sst8: the eight-state (Q8) secondary structure

DSSP8 is a secondary structure datasets with Eight states (H,B,E,G,I,T,S,C). There are 5877 Non-Redundant chains (25%). This dataset was created by DSSP & PDB_select . Three state :[H=(G,H,I); E=(B,E); C=(T,S,C)]
The link of dataset - PDB - 31-03-2018
The site link is ccPDB(compilation and creation of datasets from PDB)

Tables

Pdb 31–07–2011

@kaggle.tamzidhasan_protein_secondary_sequence.pdb_31_07_2011

7.84 MB
17608 rows
5 columns


CREATE TABLE pdb_31_07_2011 (
  "unnamed_0" BIGINT,
  "pdb_id" VARCHAR,
  "seq" VARCHAR,
  "sst3" VARCHAR,
  "sst8" VARCHAR
);

Pdb 31–12–2012

@kaggle.tamzidhasan_protein_secondary_sequence.pdb_31_12_2012

2.37 MB
5877 rows
5 columns


CREATE TABLE pdb_31_12_2012 (
  "unnamed_0" BIGINT,
  "pdb_id" VARCHAR,
  "seq" VARCHAR,
  "sst3" VARCHAR,
  "sst8" VARCHAR
);