Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/glambard/molecules_dataset_collection
Collection of data sets of molecules for a validation of properties inference
https://github.com/glambard/molecules_dataset_collection
classification dataset inference machine-learning molecule moleculenet properties rdkit smiles
Last synced: 4 months ago
JSON representation
Collection of data sets of molecules for a validation of properties inference
- Host: GitHub
- URL: https://github.com/glambard/molecules_dataset_collection
- Owner: GLambard
- License: mit
- Created: 2018-06-18T02:11:26.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2018-06-18T04:52:22.000Z (over 6 years ago)
- Last Synced: 2024-10-09T22:06:54.581Z (4 months ago)
- Topics: classification, dataset, inference, machine-learning, molecule, moleculenet, properties, rdkit, smiles
- Size: 63.1 MB
- Stars: 97
- Watchers: 3
- Forks: 30
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Collection of data sets of molecules and properties :gift: :smile:
## What is it?
- Inspired by [Moleculenet.ai](http://moleculenet.ai/)
- Selection of data sets of molecules (SMILES) and physicochemical properties## Aim?
1. SMILES in the data sets have all been uniformized through the [RDKit](http://www.rdkit.org)
2. Cluster the data sets at the same place. They are all here!
3. Use it for validating the inference of molecular properties through various machine learning models as proposed in [Z. Wu et al.](https://arxiv.org/abs/1703.00564)## Method?
- All data sets are regularized following the [RDKit](http://www.rdkit.org/docs/GettingStartedInPython.html) methods to output isomeric, canonical and kekulise SMILES ([Daylight](http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html))
- If a SMILES was not successfully regularized, a blank replaces the SMILES compared to the original data set## But what are these data sets?
- Quantum Mechanics: **QM9**
- Physical Chemistry: **ESOL**, **FreeSolv**, **Lipophilicity**
- Biophysics: **PCBA**, **HIV**, **BACE**
- Physiology: **BBBP**, **Tox21**, **ToxCast**, **SIDER**, **ClinTox**From [Moleculenet.ai](http://moleculenet.ai/datasets-1), here are their short description and the task for inference between squared brackets (for the regularized data sets reported here):
* **QM9**: Geometric, energetic, electronic and thermodynamic properties of DFT-modelled small molecules [classification]
* **ESOL**: Water solubility data(log solubility in mols per litre) for common organic small molecules [regression]
* **FreeSolv**: Experimental and calculated hydration free energy of small molecules in water [regression]
* **Lipophilicity**: Experimental results of octanol/water distribution coefficient(logD at pH 7.4) [regression]* **PCBA**: Selected from PubChem BioAssay, consisting of measured biological activities of small molecules generated by high-throughput screening [classification]
* **HIV**: Experimentally measured abilities to inhibit HIV replication [classification]
* **BACE**: Quantitative (IC50) and qualitative (binary label) binding results for a set of inhibitors of human β-secretase 1(BACE-1) [classification/regression]* **BBBP**: Binary labels of blood-brain barrier penetration(permeability) [classification]
* **Tox21**: Qualitative toxicity measurements on 12 biological targets, including nuclear receptors and stress response pathways [classification]
* **ToxCast**: Toxicology data for a large library of compounds based on in vitro high-throughput screening, including experiments on over 600 tasks [classification]
* **SIDER**: Database of marketed drugs and adverse drug reactions (ADR), grouped into 27 system organ classes [classification]
* **ClinTox**: Qualitative data of drugs approved by the FDA and those that have failed clinical trials for toxicity reasons [classification]# Citation
**Source:** [Moleculenet.ai](http://moleculenet.ai/)
**Paper:** Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, Vijay Pande, *MoleculeNet: A Benchmark for Molecular Machine Learning*, [arXiv: 1703.00564, 2017 [cs.LG]](https://arxiv.org/abs/1703.00564)