Wikipedia Molecules Properties Dataset by Kaggle | Other

About this Dataset

Wikipedia Molecules Properties Dataset

Molecular Properties Dataset from Wikipedia

By Juan Jose [source]

About this dataset

The Wikipedia Molecules - Properties Dataset is an extensive collection of molecular properties for various chemical substances obtained from Wikipedia articles. This dataset provides valuable information about the chemical structures and characteristics of these molecules, including their hydrophobicity, size, weight, and connectivity.

Each entry in the dataset is represented by a unique molecule and includes several key features. These features encompass a variety of aspects related to the chemical structure and properties of the molecule. For example, the Molecular Weight feature indicates the mass of the molecule by summing up all atomic weights present. The Largest Chain feature specifies the length of the longest chain formed by atoms in the molecule.

Furthermore, this dataset incorporates other informative attributes such as Mannhold LogP, which represents a logarithmic measure of hydrophobicity reflecting how water-repellent or water-soluble a compound is. Additionally, it includes descriptors like Topological Polar Surface Area, that quantifies how much surface area in a compound can participate in polar interactions.

Other attributes provide insights into bonding patterns within molecules. For instance, Aromatic Bonds Count denotes how many bonds possess an aromatic character within the structure, while Largest Pi Chain indicates the length of chains formed solely by pi (π) bonds.

Moreover, various numerical measures are included to assess different aspects such as complexity (Fragment Complexity) or atomic connectivity (Vertex adjacency information magnitude). Information on hydrogen bond acceptors and donors further reveals potential sites for intermolecular interactions.

This rich dataset also encompasses counts for different elements present within each molecule alongside their respective atomic polarizabilities and bond polarizabilities—essential indicators describing induced polarization behavior at both atomistic and bond levels.

In addition to these specific molecular properties' descriptors/constants accommodating possible drug-likeness evaluations—Lipinski's Rule of Five—this dataset also provides formal charge measurements that indicate excess positive or negative charges on specific atoms.

The dataset further includes identifiers like the molecule's name and molecular formula, enabling easy reference and identification.

Overall, this Wikipedia Molecules - Properties Dataset offers a comprehensive collection of information on numerous chemical substances, making it an invaluable resource for various fields of research such as pharmaceuticals and materials science

How to use the dataset

Understanding the Columns:

The dataset consists of multiple columns, each representing a specific property or characteristic of a molecule. Some key columns include:

Molecule: The chemical structure of the molecule in text format.

Mannhold LogP: The logarithm of the partition coefficient, which measures the hydrophobicity (how soluble a compound is in water).

Molecular Weight: The mass of the molecule, calculated by summing up the atomic weights.

Atomic Polarizabilities: The ability of atoms to undergo induced polarization.

Topological Polar Surface Area: A measure of polar surface area.

Largest Pi Chain: The length (number) of pi (π) bonds in a molecule.

Data Exploration:

Before diving into any analysis or modeling task, it is essential to explore and understand your data thoroughly. Take some time to review each column's content and their respective data types.

Identifying Patterns and Trends:

Look for patterns or trends within specific properties or across different combinations of properties that might be interesting for your research or analysis goals.

Statistical Analysis:

Make use statistical techniques like mean, median, standard deviation to understand central tendencies and variations within different molecular properties.

Visualization Techniques:

Utilize various visualization techniques such as histograms, scatter plots, box plots etc., to gain further insights into relationships between different molecular properties.

Feature Engineering:

Create new features from existing ones derived from domain knowledge to develop better models and increase the predictive power of your analysis.

Predictive Modelling:

Apply machine learning algorithms or statistical models to predict certain properties or outcomes based on molecular characteristics.

Drug Discovery Applications:

Utilize the dataset for drug discovery purposes, such as identifying drug-like molecules based on Lipinski's Rule of Five or predicting molecular properties related to activity, toxicity, and solubility.

Chemical Similarity Analysis:

Conduct similarity analysis using chemical fingerprinting techniques like molecular fingerprints, MACCS keys, etc., to find similar compounds based on their structural features.

10

Research Ideas

Drug Discovery: The dataset can be used to analyze the molecular properties of different chemical substances and identify potential drug candidates. By examining properties such as lipophilicity (Mannhold LogP) and molecular weight, researchers can assess the drug-likeness of molecules and prioritize compounds for further testing and development.

Structure-Property Relationship Analysis: The dataset provides a wide range of molecular properties, including atomic polarizabilities, bond count, and fragment complexity. Researchers can use this information to understand the relationship between chemical structure and various physical properties, such as polarizability or hydrogen bonding capacity.

Chemical Similarity Clustering: By comparing the molecular formulas or structures (Molecule), researchers can cluster similar compounds together to identify families of related chemical substances. This can be useful for studying the structure-activity relationship within a group of compounds or identifying structural features common among certain classes of chemicals

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: properties.csv

Column name	Description
Molecule	The chemical structure of the molecule. (Text)
Molecule name	The name of the chemical substance. (Text)
Mannhold LogP	The logarithm of the partition coefficient, which measures hydrophobicity. (Numeric)
Atomic Polarizabilities	Polarizability of atoms in the molecule. (Numeric)
Aromatic Atoms Count	Number of aromatic atoms in the molecule. (Numeric)
Aromatic Bonds Count	Number of aromatic bonds in the molecule. (Numeric)
Element Count	Number of elements present in the molecule. (Numeric)
Bond Polarizabilities	Polarizability of bonds in the molecule. (Numeric)
Bond Count	Number of bonds in the molecule. (Numeric)
Fragment Complexity	Complexity of the molecular fragment. (Numeric)
VABC Volume Descriptor	Vertex adjacency information magnitude. (Numeric)
Hydrogen Bond Acceptors	Number of hydrogen bond acceptor sites present in the molecule. (Numeric)
Hydrogen Bond Donors	Number of hydrogen bond donor sites present in the molecule. (Numeric)
Largest Chain	Length of the largest chain in the molecule. (Numeric)
Largest Pi Chain	Length of the largest pi chain in the molecule. (Numeric)
Petitjean Number	Complexity measure of the molecule. (Numeric)
Rotatable Bonds Count	Number of rotatable bonds in the molecule, including non-terminal ones. (Numeric)
Lipinski's Rule of Five	Evaluation scores for compliance with Lipinski's Rule of Five, a rule used to assess drug-likeness. (Numeric)
Topological Polar Surface Area	A measure of the polar surface area of the molecule. (Numeric)
Molecular Weight	The mass of the molecule calculated by summing the atomic weights. (Numeric)
XLogP	The logarithm of the partition coefficient, which measures the hydrophobicity of the molecule. (Numeric)
Molecular Formula	The chemical formula of the molecule. (Text)
Formal Charge	The overall formal charge of the molecule. (Numeric)
Formal Charge (pos)	The positive formal charge of the molecule. (Numeric)
Formal Charge (neg)	The negative formal charge of the molecule. (Numeric)
Heavy Atoms Count	The number of heavy atoms (non-hydrogen atoms) in the molecule. (Numeric)
Molar Mass	The molar mass of the molecule. (Numeric)
SP3 Character	The percentage of sp3 hybridization character in the carbon atoms of the molecule. (Numeric)
Rotatable Bonds Count (non terminal)	The number of rotatable bonds in the molecule, including non-terminal ones. (Numeric)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Juan Jose.

Tables

Properties

@kaggle.thedevastator_wikipedia_molecules_properties_dataset.properties

1.74 MB
15166 rows
34 columns


CREATE TABLE properties (
  "index" BIGINT,
  "row_id" VARCHAR,
  "molecule" VARCHAR,
  "molecule_name" VARCHAR,
  "mannhold_logp" DOUBLE,
  "atomic_polarizabilities" VARCHAR,
  "aromatic_atoms_count" BIGINT,
  "aromatic_bonds_count" BIGINT,
  "element_count" BIGINT,
  "bond_polarizabilities" VARCHAR,
  "bond_count" BIGINT,
  "eccentric_connectivity_index" DOUBLE,
  "fragment_complexity" DOUBLE,
  "vabc_volume_descriptor" VARCHAR,
  "hydrogen_bond_acceptors" BIGINT,
  "hydrogen_bond_donors" BIGINT,
  "largest_chain" BIGINT,
  "largest_pi_chain" BIGINT,
  "petitjean_number" DOUBLE,
  "rotatable_bonds_count" BIGINT,
  "lipinski_s_rule_of_five" BIGINT,
  "topological_polar_surface_area" VARCHAR,
  "vertex_adjacency_information_magnitude" DOUBLE,
  "molecular_weight" VARCHAR,
  "xlogp" DOUBLE,
  "zagreb_index" BIGINT,
  "molecular_formula" VARCHAR,
  "formal_charge" BIGINT,
  "formal_charge_pos" BIGINT,
  "formal_charge_neg" BIGINT,
  "heavy_atoms_count" BIGINT,
  "molar_mass" VARCHAR,
  "sp3_character" DOUBLE,
  "rotatable_bonds_count_non_terminal" BIGINT
);