Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/glambard/molecules_dataset_collection

Collection of data sets of molecules for a validation of properties inference
https://github.com/glambard/molecules_dataset_collection

classification dataset inference machine-learning molecule moleculenet properties rdkit smiles

Last synced: 7 days ago
JSON representation

Collection of data sets of molecules for a validation of properties inference

Awesome Lists containing this project

README

        

# Collection of data sets of molecules and properties :gift: :smile:

## What is it?

- Inspired by [Moleculenet.ai](http://moleculenet.ai/)
- Selection of data sets of molecules (SMILES) and physicochemical properties

## Aim?

1. SMILES in the data sets have all been uniformized through the [RDKit](http://www.rdkit.org)
2. Cluster the data sets at the same place. They are all here!
3. Use it for validating the inference of molecular properties through various machine learning models as proposed in [Z. Wu et al.](https://arxiv.org/abs/1703.00564)

## Method?

- All data sets are regularized following the [RDKit](http://www.rdkit.org/docs/GettingStartedInPython.html) methods to output isomeric, canonical and kekulise SMILES ([Daylight](http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html))
- If a SMILES was not successfully regularized, a blank replaces the SMILES compared to the original data set

## But what are these data sets?

- Quantum Mechanics: **QM9**
- Physical Chemistry: **ESOL**, **FreeSolv**, **Lipophilicity**
- Biophysics: **PCBA**, **HIV**, **BACE**
- Physiology: **BBBP**, **Tox21**, **ToxCast**, **SIDER**, **ClinTox**

From [Moleculenet.ai](http://moleculenet.ai/datasets-1), here are their short description and the task for inference between squared brackets (for the regularized data sets reported here):

* **QM9**: Geometric, energetic, electronic and thermodynamic properties of DFT-modelled small molecules [classification]

* **ESOL**: Water solubility data(log solubility in mols per litre) for common organic small molecules [regression]
* **FreeSolv**: Experimental and calculated hydration free energy of small molecules in water [regression]
* **Lipophilicity**: Experimental results of octanol/water distribution coefficient(logD at pH 7.4) [regression]

* **PCBA**: Selected from PubChem BioAssay, consisting of measured biological activities of small molecules generated by high-throughput screening [classification]
* **HIV**: Experimentally measured abilities to inhibit HIV replication [classification]
* **BACE**: Quantitative (IC50) and qualitative (binary label) binding results for a set of inhibitors of human β-secretase 1(BACE-1) [classification/regression]

* **BBBP**: Binary labels of blood-brain barrier penetration(permeability) [classification]
* **Tox21**: Qualitative toxicity measurements on 12 biological targets, including nuclear receptors and stress response pathways [classification]
* **ToxCast**: Toxicology data for a large library of compounds based on in vitro high-throughput screening, including experiments on over 600 tasks [classification]
* **SIDER**: Database of marketed drugs and adverse drug reactions (ADR), grouped into 27 system organ classes [classification]
* **ClinTox**: Qualitative data of drugs approved by the FDA and those that have failed clinical trials for toxicity reasons [classification]

# Citation

**Source:** [Moleculenet.ai](http://moleculenet.ai/)

**Paper:** Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, Vijay Pande, *MoleculeNet: A Benchmark for Molecular Machine Learning*, [arXiv: 1703.00564, 2017 [cs.LG]](https://arxiv.org/abs/1703.00564)