Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-chemistry-datasets

overview of datasets for ML in chemistry
https://github.com/kjappelbaum/awesome-chemistry-datasets

Last synced: 3 days ago
JSON representation

text datasets
- BC5CDR - disease interactions (named entity recognition)
- BioCreative V - disease interactions.
- BioRxiv XML - Bulk access to the full text of bioRxiv articles for the purposes of text and data mining (TDM) is available via a dedicated Amazon S3 resource.
- ChemTables - 021-00568-2#Abs1). Licensed under CC BY NC 3.0.
- Europe PMC - Bulk download of full text and SI of > 5 million articles.
- IUPAC Gold Book
- LibreText - access chemistry textbook.
- MedRxiv XML - Text and data mining is possible via dedicated Amazon S3 resource.
- NLM literature archive - text resource on chemicals in the biomedical literature. It contains 150 full-text journal articles selected both to be rich in chemical mentions and for articles where human annotation was expected to be most valuable. However, I saw NLM literature archive already on the list but wasn't sure if it included this dataset
- OpenStax - 2e), which is released under CC-BY 4.0.
- PubChemSTM
- PubMed
- PubMedQA
- PubMed central - text archive
- S2ORC - language academic papers spanning many academic disciplines largest publicly-available collection of machine-readable academic text). Released under CC BY-NC 4.0.
- Elsevier Corpus - BY articles from across Elsevier’s journals represent the first cross-discipline research of data at this scale to support NLP and ML research.
structures
- COCONUT
- Crystallography Open Database - access collection of crystal structures of organic, inorganic, metal-organic compounds and minerals, excluding biopolymers. [They also derived SMILES for some compounds.](https://doi.org/10.1186/s13321-018-0279-6)
- Enamine HTS collection
- GDB
- GNPS
- MoNA
- nmrshiftdb2
- zinc20 - accelerated virtual screening
- zinc22 - available compounds for virtual screening
- nCov-Group Data Repository
molecular activity prediction benchmark datsets
- MPCD - sample size and narrow-scaffold inhibitors datasets(LSSNS) and 30 Higher-sample size and mixed-scaffold inhibitor datasets(HSSMS), each dataset is visulised by [TMAP](https://bidd-group.github.io/MPCD/dataset/HSSMS/MoleculeACE_benchmark/space/info/CHEMBL4792_Ki.html)
- MoleculeACE
ml structure-property benchmark datasets
- Aquasoldb
- BigSolDB - details/6426c1d8db1a20696e4c947b).
- BindingDB
- ChEBI-20 - description pairs (for molecule captioning task)
- ESol
- Flashpoint
- Harvard OPV - chemical calculations performed over a range of geometries, each with quantum chemical results using a variety of density functionals and basis sets"
- Hydrogen Storage Materials Database
- ILThermo
- Leffingwell Odor Dataset - labeled odor descriptors from the Leffingwell PMP 2001 database
- Limiting activity coefficients - based transformer.
- Lipophilicty
- MD simulated monomer properties
- MoleculeNet - Benchmark suite that contains multiple datasets listed here
- oechem - BY (version 4.0) license to data submitted)
- Papyrus - DB combined with smaller datasets.
- QM Datasets
- SolProp - RS calculations and 10145 experimental solvation free energies (originally published as part of [this paper](https://arxiv.org/abs/2012.11730)).
- SOMAS - flow batteries.
- Therapeutic Data Commons
- ThermoML Archive
- LIT-PCBA - confidence PubChem Bioassay data.
- ACNet - cliffs and 380K non-AC MMPs from ChEMBL (version 28).
- FreeSolv
- Photoswitch Dataset
- BigSolDB - details/6426c1d8db1a20696e4c947b).
- Leffingwell Odor Dataset - labeled odor descriptors from the Leffingwell PMP 2001 database
Target identification data
- Open Targets - scale resource that uses human genetics and genomics data for systematic drug target identification and prioritization.
- Probes & Drugs Portal
Pharmacology & ADME & Metabolism
- SIDER dataset - readable form despite the importance of research on drugs and their effects. Creation of this resource was is related to paper (Campillos, Kuhn et al., Science, 2008, 321(5886):263-6.) on the utilization of side effects for drug target prediction. Released under CC BY-NC-SA 4.0.
- Cell Effective Permeability (Caco-2) dataset - 2).
- Drug Indications Database (DID) - indication relations. It is intended to facilitate the building of practical, comprehensive, integrated drug ontologies.
- EPA CompTox
- Guide to PHARMACOLOGY - curated resource of ligand-activity-target relationships. It includes activity data even for data with unknown bioactivity value (under CC BY-SA 4.0).
- KEGG PATHWAY Database(KEGG) - level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies.
- LOTUS - organism pairs (relationships between molecular structures and the living organisms from which they were identified).
- MetXBioDB Metabolite Biotransformations
- PAMPA Permeability and NCATS dataset
- PsychonautWiki - altering substances
- QSAR datasets - Meta-QSAR (phase I & II) - QSAR: a large-scale application of meta-learning to drug design and discovery.
- The Human Metabolome Database (HMDB)
- KD-DTI - target-interaction triplets (12K training samples, 1K validation samples and 1.1K test samples). See [paper](https://academic.oup.com/bioinformatics/article/38/22/5100/6751771?rss=1#382115390).
- ONSIDES
- The Metabolism and Transport Database
- Clinical Trials
- LOTUS - organism pairs (relationships between molecular structures and the living organisms from which they were identified).
- MetXBioDB Metabolite Biotransformations
- Drug–Drug–Interaction (DDI) - drug interactions as well as documents describing drug-drug interactions from the DrugBank database.
reactions
- USPTO - mining from United States patents published between 1976 and September 2016.
- RDB7 - mapped SMILES, barrier heights, and reaction enthalpies calculated at CCSD(T)-F12, which is known to be very accurate. Geometries are identified via the growing string method in this [paper](https://www.nature.com/articles/s41597-020-0460-4) while the high-quality energies are computed in this [paper](https://www.nature.com/articles/s41597-022-01529-6).
- RDB7 - mapped SMILES, barrier heights, and reaction enthalpies calculated at CCSD(T)-F12, which is known to be very accurate. Geometries are identified via the growing string method in this [paper](https://www.nature.com/articles/s41597-020-0460-4) while the high-quality energies are computed in this [paper](https://www.nature.com/articles/s41597-022-01529-6).
high-throughput screening data
- Perera - catalysed Suzuki-Miyaura C-C cross-couplings
- Dreher-Doyle - catalysed Buchwald–Hartwig C–N crosscouplings
eln data

Programming Languages

Python 6 Jupyter Notebook 1 SCSS 1

Ecosyste.ms: Awesome

awesome-chemistry-datasets

text datasets

structures

molecular activity prediction benchmark datsets

ml structure-property benchmark datasets

Target identification data

Pharmacology & ADME & Metabolism

reactions

high-throughput screening data

eln data