Baselight

Protein Secondary Structure - 2022

Protein sequences and assigned secondary structures

@kaggle.kirkdco_protein_secondary_structure_2022

Loading...
Loading...

About this Dataset

Protein Secondary Structure - 2022

Acknowledgement

This dataset is an update of the Protein Secondary Structure Dataset. I am indebted to alfrandom for the original work and wonderful Github repository that was used to create this update as well as the permission to create this update. I hope that this update is helpful and extends the original in useful and meaningful ways.

Background

Proteins are the operational units of life and perform an extensive number of fundamental functions from enzymes and immune function to movement and structure. The most basic description of a protein is its primary amino acid sequence - the collection of individual subunits that make up the protein. The sequence of amino acids folds into a few fundamental shapes termed secondary structures. From there, the secondary structures fold further into the 3-dimensional shape of the protein - the tertiary structure. Further, multiple individual proteins may group together into a final functional unit - quarternary structure.

While AlphaFold has made huge strides toward solving the problem of predicting 3D protein structure just from the primary sequence, it has benefited from decades of work in crystallography and structural biology that created a rich collection of X-ray, cryo-electron microscopy, and NMR structures, as well as fundamental research into how proteins evolve.

Dataset

This dataset provides a collection of protein sequences and their secondary structures observed in 3D crystal structures.

The original dataset was created in 2018 and consisted of 9078 sequences with lengths ranging from 20 to 1632 amino acids. In this update, I have used the latest data from the RCSB-PDB (as of 6 August 2022) and relaxed some of the criteria used for data culling. Specifically, the original dataset had a cutoff of 25% identity for any pair of sequences and a 2.0 Angstrom resolution of the crystal structure. In this update, the following cutoffs were used and provided sequences of at least 40 amino acids in length.

2022-08-06 Dataset
UPDATE
I found that this dataset is not current as of 2022-08-06, but rather sometime in July 2020. The ss.txt file downloadable from PDB is dated July 2020 and is no longer updated with new information. I’ve left the file names here as they correspond to the culled file list, but the actual sequence and structure content is 2 years older.

Percent Identity Resolution Number of Sequences
25% 2.0 7320
25% 2.5 9646
30% 2.5 13406

UPDATE
I developed new code to download all PDB files in the culled lists (15500+ structures, missing about 150 that could not be downloaded) using BioPython. I then generated all the SST3 and SST8 structural information using BioPython and DSSP. This added over 1000 structures to each file. All code will be updated on the pdb-secondary-structure-2022 github repository. The PDB structures will not be included due to space limitations (10 GB uncompressed).

Percent Identity Resolution Number of Sequences
25% 2.0 8313
25% 2.5 10931
30% 2.5 15080

Files and Column Descriptions

2022-08-06-pdbintersect-pisces_pc25_r2.0.csv
2022-08-06-pdbintersect-pisces_pc25_r2.5.csv
2022-08-06-pdbintersect-pisces_pc30_r2.5.csv

2022-12-17-pdbintersect-pisces_pc25_r2.0.csv
2022-12-17-pdbintersect-pisces_pc25_r2.5.csv
2022-12-17-pdbintersect-pisces_pc30_r2.5.csv

These files contain the subset of sequences and secondary structures culled based upon specific percent identity and structure resolution cutoffs. PISCES lists were used to create the datasets provided here.

More on Secondary Structure

SST-8 and SST-3 classifications are provided for each protein sequence. SST-8 consists of 8 categories of secondary structures based upon geometric rules of classification. SST-3 gathers similar SST-8 catagories into a simpler and more general set of structures.

SST-8 Category Description SST-3 Category
E β-strand E
B β-bridge E
H α-helix H
G 3-helix H
I π-helix H
C Loops and irregular elements C
T Turn C
S Bend C

Code

A fork (pdb-secondary-structure-2022) of the original Github repository was made. Updates are noted in the code and the README.md.

Tables

N 2022–08–03 Ss Cleaned

@kaggle.kirkdco_protein_secondary_structure_2022.n_2022_08_03_ss_cleaned
  • 85.26 MB
  • 477153 rows
  • 7 columns
Loading...

CREATE TABLE n_2022_08_03_ss_cleaned (
  "pdb_id" VARCHAR,
  "chain_code" VARCHAR,
  "seq" VARCHAR,
  "sst8" VARCHAR,
  "sst3" VARCHAR,
  "len" BIGINT,
  "has_nonstd_aa" BOOLEAN
);

N 2022–08–06 Pdb Intersect Pisces Pc25 R2–0

@kaggle.kirkdco_protein_secondary_structure_2022.n_2022_08_06_pdb_intersect_pisces_pc25_r2_0
  • 3.13 MB
  • 7320 rows
  • 12 columns
Loading...

CREATE TABLE n_2022_08_06_pdb_intersect_pisces_pc25_r2_0 (
  "pdb_id" VARCHAR,
  "chain_code" VARCHAR,
  "seq" VARCHAR,
  "sst8" VARCHAR,
  "sst3" VARCHAR,
  "len_x" BIGINT,
  "has_nonstd_aa" BOOLEAN,
  "len_y" BIGINT,
  "method" VARCHAR,
  "resol" DOUBLE,
  "rfac" DOUBLE,
  "freerfac" DOUBLE
);

N 2022–08–06 Pdb Intersect Pisces Pc25 R2–5

@kaggle.kirkdco_protein_secondary_structure_2022.n_2022_08_06_pdb_intersect_pisces_pc25_r2_5
  • 4.17 MB
  • 9646 rows
  • 12 columns
Loading...

CREATE TABLE n_2022_08_06_pdb_intersect_pisces_pc25_r2_5 (
  "pdb_id" VARCHAR,
  "chain_code" VARCHAR,
  "seq" VARCHAR,
  "sst8" VARCHAR,
  "sst3" VARCHAR,
  "len_x" BIGINT,
  "has_nonstd_aa" BOOLEAN,
  "len_y" BIGINT,
  "method" VARCHAR,
  "resol" DOUBLE,
  "rfac" DOUBLE,
  "freerfac" DOUBLE
);

N 2022–08–06 Pdb Intersect Pisces Pc30 R2–5

@kaggle.kirkdco_protein_secondary_structure_2022.n_2022_08_06_pdb_intersect_pisces_pc30_r2_5
  • 5.86 MB
  • 13406 rows
  • 12 columns
Loading...

CREATE TABLE n_2022_08_06_pdb_intersect_pisces_pc30_r2_5 (
  "pdb_id" VARCHAR,
  "chain_code" VARCHAR,
  "seq" VARCHAR,
  "sst8" VARCHAR,
  "sst3" VARCHAR,
  "len_x" BIGINT,
  "has_nonstd_aa" BOOLEAN,
  "len_y" BIGINT,
  "method" VARCHAR,
  "resol" DOUBLE,
  "rfac" DOUBLE,
  "freerfac" DOUBLE
);

N 2022–12–17 Pdb Intersect Pisces Pc25 R2–0

@kaggle.kirkdco_protein_secondary_structure_2022.n_2022_12_17_pdb_intersect_pisces_pc25_r2_0
  • 3.45 MB
  • 8313 rows
  • 12 columns
Loading...

CREATE TABLE n_2022_12_17_pdb_intersect_pisces_pc25_r2_0 (
  "pdb_id" VARCHAR,
  "chain_code" VARCHAR,
  "seq" VARCHAR,
  "sst8" VARCHAR,
  "sst3" VARCHAR,
  "len_x" BIGINT,
  "has_nonstd_aa" BOOLEAN,
  "len_y" BIGINT,
  "method" VARCHAR,
  "resol" DOUBLE,
  "rfac" DOUBLE,
  "freerfac" DOUBLE
);

N 2022–12–17 Pdb Intersect Pisces Pc25 R2–5

@kaggle.kirkdco_protein_secondary_structure_2022.n_2022_12_17_pdb_intersect_pisces_pc25_r2_5
  • 4.58 MB
  • 10931 rows
  • 12 columns
Loading...

CREATE TABLE n_2022_12_17_pdb_intersect_pisces_pc25_r2_5 (
  "pdb_id" VARCHAR,
  "chain_code" VARCHAR,
  "seq" VARCHAR,
  "sst8" VARCHAR,
  "sst3" VARCHAR,
  "len_x" BIGINT,
  "has_nonstd_aa" BOOLEAN,
  "len_y" BIGINT,
  "method" VARCHAR,
  "resol" DOUBLE,
  "rfac" DOUBLE,
  "freerfac" DOUBLE
);

N 2022–12–17 Pdb Intersect Pisces Pc30 R2–5

@kaggle.kirkdco_protein_secondary_structure_2022.n_2022_12_17_pdb_intersect_pisces_pc30_r2_5
  • 6.39 MB
  • 15079 rows
  • 12 columns
Loading...

CREATE TABLE n_2022_12_17_pdb_intersect_pisces_pc30_r2_5 (
  "pdb_id" VARCHAR,
  "chain_code" VARCHAR,
  "seq" VARCHAR,
  "sst8" VARCHAR,
  "sst3" VARCHAR,
  "len_x" BIGINT,
  "has_nonstd_aa" BOOLEAN,
  "len_y" BIGINT,
  "method" VARCHAR,
  "resol" DOUBLE,
  "rfac" DOUBLE,
  "freerfac" DOUBLE
);

Share link

Anyone who has the link will be able to view this.