Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-chemistry-datasets

overview of datasets for ML in chemistry
https://github.com/kjappelbaum/awesome-chemistry-datasets

Last synced: 3 days ago
JSON representation

  • text datasets

    • BC5CDR - disease interactions (named entity recognition)
    • BioCreative V - disease interactions.
    • BioRxiv XML - Bulk access to the full text of bioRxiv articles for the purposes of text and data mining (TDM) is available via a dedicated Amazon S3 resource.
    • ChemTables - 021-00568-2#Abs1). Licensed under CC BY NC 3.0.
    • Europe PMC - Bulk download of full text and SI of > 5 million articles.
    • IUPAC Gold Book
    • LibreText - access chemistry textbook.
    • MedRxiv XML - Text and data mining is possible via dedicated Amazon S3 resource.
    • NLM literature archive - text resource on chemicals in the biomedical literature. It contains 150 full-text journal articles selected both to be rich in chemical mentions and for articles where human annotation was expected to be most valuable. However, I saw NLM literature archive already on the list but wasn't sure if it included this dataset
    • OpenStax - 2e), which is released under CC-BY 4.0.
    • PubChemSTM
    • PubMed
    • PubMedQA
    • PubMed central - text archive
    • S2ORC - language academic papers spanning many academic disciplines largest publicly-available collection of machine-readable academic text). Released under CC BY-NC 4.0.
    • Elsevier Corpus - BY articles from across Elsevier’s journals represent the first cross-discipline research of data at this scale to support NLP and ML research.
  • structures

  • molecular activity prediction benchmark datsets

    • MPCD - sample size and narrow-scaffold inhibitors datasets(LSSNS) and 30 Higher-sample size and mixed-scaffold inhibitor datasets(HSSMS), each dataset is visulised by [TMAP](https://bidd-group.github.io/MPCD/dataset/HSSMS/MoleculeACE_benchmark/space/info/CHEMBL4792_Ki.html)
    • MoleculeACE
  • ml structure-property benchmark datasets

  • Target identification data

  • Pharmacology & ADME & Metabolism

  • reactions

    • USPTO - mining from United States patents published between 1976 and September 2016.
    • RDB7 - mapped SMILES, barrier heights, and reaction enthalpies calculated at CCSD(T)-F12, which is known to be very accurate. Geometries are identified via the growing string method in this [paper](https://www.nature.com/articles/s41597-020-0460-4) while the high-quality energies are computed in this [paper](https://www.nature.com/articles/s41597-022-01529-6).
    • RDB7 - mapped SMILES, barrier heights, and reaction enthalpies calculated at CCSD(T)-F12, which is known to be very accurate. Geometries are identified via the growing string method in this [paper](https://www.nature.com/articles/s41597-020-0460-4) while the high-quality energies are computed in this [paper](https://www.nature.com/articles/s41597-022-01529-6).
  • high-throughput screening data

    • Perera - catalysed Suzuki-Miyaura C-C cross-couplings
    • Dreher-Doyle - catalysed Buchwald–Hartwig C–N crosscouplings
  • eln data