{"id":15140812,"url":"https://github.com/glambard/molecules_dataset_collection","last_synced_at":"2025-10-23T18:30:25.269Z","repository":{"id":202193846,"uuid":"137701812","full_name":"GLambard/Molecules_Dataset_Collection","owner":"GLambard","description":"Collection of data sets of molecules for a validation of properties inference","archived":false,"fork":false,"pushed_at":"2018-06-18T04:52:22.000Z","size":66217,"stargazers_count":102,"open_issues_count":1,"forks_count":30,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-01-30T20:05:02.580Z","etag":null,"topics":["classification","dataset","inference","machine-learning","molecule","moleculenet","properties","rdkit","smiles"],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GLambard.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-06-18T02:11:26.000Z","updated_at":"2025-01-21T22:19:15.000Z","dependencies_parsed_at":null,"dependency_job_id":"cf6991cd-e8b7-4dd0-aba0-8d58f78110f8","html_url":"https://github.com/GLambard/Molecules_Dataset_Collection","commit_stats":null,"previous_names":["glambard/molecules_dataset_collection"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GLambard%2FMolecules_Dataset_Collection","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GLambard%2FMolecules_Dataset_Collection/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GLambard%2FMolecules_Dataset_Collection/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GLambard%2FMolecules_Dataset_Collection/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GLambard","download_url":"https://codeload.github.com/GLambard/Molecules_Dataset_Collection/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":237869080,"owners_count":19379263,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["classification","dataset","inference","machine-learning","molecule","moleculenet","properties","rdkit","smiles"],"created_at":"2024-09-26T08:41:30.094Z","updated_at":"2025-10-23T18:30:15.246Z","avatar_url":"https://github.com/GLambard.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Collection of data sets of molecules and properties :gift: :smile:\n\n## What is it? \n\n- Inspired by [Moleculenet.ai](http://moleculenet.ai/)\n- Selection of data sets of molecules (SMILES) and physicochemical properties\n\n## Aim?\n\n1. SMILES in the data sets have all been uniformized through the [RDKit](http://www.rdkit.org)\n2. Cluster the data sets at the same place. They are all here!\n3. Use it for validating the inference of molecular properties through various machine learning models as proposed in [Z. Wu et al.](https://arxiv.org/abs/1703.00564)\n\n## Method?\n\n- All data sets are regularized following the [RDKit](http://www.rdkit.org/docs/GettingStartedInPython.html) methods to output isomeric, canonical and kekulise SMILES ([Daylight](http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html)) \n- If a SMILES was not successfully regularized, a blank replaces the SMILES compared to the original data set\n\n## But what are these data sets?\n\n- Quantum Mechanics: **QM9**\n- Physical Chemistry: **ESOL**, **FreeSolv**, **Lipophilicity**\n- Biophysics: **PCBA**, **HIV**, **BACE**\n- Physiology: **BBBP**, **Tox21**, **ToxCast**, **SIDER**, **ClinTox**\n\nFrom [Moleculenet.ai](http://moleculenet.ai/datasets-1), here are their short description and the task for inference between squared brackets (for the regularized data sets reported here): \n\n* **QM9**: Geometric,  energetic, electronic and thermodynamic properties of DFT-modelled small molecules [classification]\n\n* **ESOL**: Water solubility data(log solubility in mols per litre) for common organic small molecules [regression]\n* **FreeSolv**: Experimental and calculated hydration free energy of small molecules in water [regression]\n* **Lipophilicity**: Experimental results of octanol/water distribution coefficient(logD at pH 7.4) [regression]\n\n* **PCBA**: Selected from PubChem BioAssay, consisting of measured biological activities of small molecules generated by high-throughput screening [classification]\n* **HIV**: Experimentally measured abilities to inhibit HIV replication [classification]\n* **BACE**: Quantitative (IC50) and qualitative (binary label) binding results for a set of inhibitors of human β-secretase 1(BACE-1) [classification/regression]\n\n* **BBBP**: Binary labels of blood-brain barrier penetration(permeability) [classification]\n* **Tox21**: Qualitative toxicity measurements on 12 biological targets, including nuclear receptors and stress response pathways [classification]\n* **ToxCast**: Toxicology data for a large library of compounds based on in vitro high-throughput screening, including experiments on over 600 tasks [classification]\n* **SIDER**: Database of marketed drugs and adverse drug reactions (ADR), grouped into 27 system organ classes [classification]\n* **ClinTox**: Qualitative data of drugs approved by the FDA and those that have failed clinical trials for toxicity reasons [classification]\n\n# Citation\n\n**Source:** [Moleculenet.ai](http://moleculenet.ai/)\n\n**Paper:** Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, Vijay Pande, *MoleculeNet: A Benchmark for Molecular Machine Learning*, [arXiv: 1703.00564, 2017 [cs.LG]](https://arxiv.org/abs/1703.00564)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fglambard%2Fmolecules_dataset_collection","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fglambard%2Fmolecules_dataset_collection","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fglambard%2Fmolecules_dataset_collection/lists"}