Multi-label classification
Dataset Description
Background
Enzymes are known to act on molecules with structural similarities with their substrates. This behaviour is called promiscuity. Scientists working in drug discovery use this behaviour to target/design drugs to either block or promote biological actions. But, correct prediction of EC class(s) of substrates associated with enzymes has been a challenge in biology. Since there is no shortage of data, ML techniques can be employed to solve the aforementioned problem.
Points to keep in mind
- Substrate molecules can belong to multiple EC-Classes at the same time as same molecules participate in different types of reactions in biology
- Dataset is highly imbalanced in labels
- Need an algorithm that can tackle label imbalance
- Smallest label count is 1 and highest label count is 248
Content
There are 3 files names mixed_(desc, ecfp, fcfp).csv containing chemical, structural, connectivity information.
Related Datasets
-
ArXiv Paper Abstracts
@kaggle
-
Eucalyptus Growth And Environmental Data
@euremarkable
-
Pl@ntNet-300K-v2 Image Dataset
@zenodo